This message was deleted Rancher Users #rke2

Join Slack

This message was deleted.

# rke2

adamant-kite-43734

08/19/2024, 9:30 PM

This message was deleted.

creamy-pencil-82913

08/19/2024, 9:34 PM

Are those helm install jobs actually still running? They should take no longer than a few seconds.

creamy-pencil-82913

08/19/2024, 9:34 PM

check the pod logs perhaps?

red-magician-75203

08/19/2024, 9:36 PM

Yeah they are still running. The pod logs are just stuck

Copy code

[root@ip-10-1-0-24 manifests]# kubectl logs helm-install-rke2-snapshot-controller-crd-hwgrv -n kube-system
if [[ ${KUBERNETES_SERVICE_HOST} =~ .*:.* ]]; then
	echo "KUBERNETES_SERVICE_HOST is using IPv6"
	CHART="${CHART//%\{KUBERNETES_API\}%/[${KUBERNETES_SERVICE_HOST}]:${KUBERNETES_SERVICE_PORT}}"
else
	CHART="${CHART//%\{KUBERNETES_API\}%/${KUBERNETES_SERVICE_HOST}:${KUBERNETES_SERVICE_PORT}}"
fi

set +v -x
+ [[ '' != \t\r\u\e ]]
+ export HELM_HOST=127.0.0.1:44134
+ HELM_HOST=127.0.0.1:44134
+ tiller --listen=127.0.0.1:44134 --storage=secret
+ helm_v2 init --skip-refresh --client-only --stable-repo-url <https://charts.helm.sh/stable/>
[main] 2024/08/19 13:13:02 Starting Tiller v2.17.0 (tls=false)
[main] 2024/08/19 13:13:02 GRPC listening on 127.0.0.1:44134
[main] 2024/08/19 13:13:02 Probes listening on :44135
[main] 2024/08/19 13:13:02 Storage driver is Secret
[main] 2024/08/19 13:13:02 Max history per release is 0

creamy-pencil-82913

08/19/2024, 9:37 PM

are they both stuck at the same spot?

red-magician-75203

08/19/2024, 9:37 PM

The validation-webhook is failing because it is not able to find the CRDs for snapshot storage.k8s.io

red-magician-75203

08/19/2024, 9:38 PM

Copy code

[root@ip-10-1-0-24 manifests]# kubectl logs helm-install-rke2-snapshot-controller-q2d4z -n kube-system
if [[ ${KUBERNETES_SERVICE_HOST} =~ .*:.* ]]; then
	echo "KUBERNETES_SERVICE_HOST is using IPv6"
	CHART="${CHART//%\{KUBERNETES_API\}%/[${KUBERNETES_SERVICE_HOST}]:${KUBERNETES_SERVICE_PORT}}"

....

+ [[ null =~ ^(|null)$ ]]
+ [[ null =~ ^(|null)$ ]]
+ echo 'Installing helm_v3 chart'
+ helm_v3 install --set-string global.cattle.systemDefaultRegistry=10.1.0.12 --set-string global.clusterCIDR=10.42.0.0/16 --set-string global.clusterCIDRv4=10.42.0.0/16 --set-string global.clusterDNS=10.43.0.10 --set-string global.clusterDomain=cluster.local --set-string global.rke2DataDir=/var/lib/rancher/rke2 --set-string global.serviceCIDR=10.43.0.0/16 --set-string global.systemDefaultIngressClass=ingress-nginx --set-string global.systemDefaultRegistry=10.1.0.12 rke2-snapshot-controller /tmp/rke2-snapshot-controller.tgz
Error: INSTALLATION FAILED: execution error at (rke2-snapshot-controller/templates/validate-install-crd.yaml:13:7): Required CRDs are missing. Please install the corresponding CRD chart before installing this chart.
+ exit

It say it is exitiing but the pod is still running.

Copy code

helm-install-rke2-snapshot-controller-q2d4z                        1/1     Running       10 (8h ago)     8h

creamy-pencil-82913

08/19/2024, 9:39 PM

Try deleting that pod and see if it re-runs successfully. I’ve never seen it hang at that specific spot before.

creamy-pencil-82913

08/19/2024, 9:39 PM

something else odd going on with the node that pod is running on perhaps?

red-magician-75203

08/19/2024, 9:40 PM

Any pointers where to look at in the node?

red-magician-75203

08/19/2024, 9:41 PM

Yeah, I have seen this a couple of times over the weekend. I have tried installing at least 10 clusters

creamy-pencil-82913

08/19/2024, 9:47 PM

what distro are you installing on? Is anything else going on with the nodes at the time that job is trying to run?

red-magician-75203

08/19/2024, 9:49 PM

I am running on AWS. Nothing else is running on the node. Just installation of RKE2.

creamy-pencil-82913

08/19/2024, 9:49 PM

That command should finish quickly, and then it should run a

helm_v2 ls

with a 30 second timeout that should also finish very quickly as there should not be any legacy helm v2 charts

creamy-pencil-82913

08/19/2024, 9:50 PM

does it stick there every time? or does it run successfully after you delete the pod so that it can run again?

creamy-pencil-82913

08/19/2024, 9:51 PM

running what on AWS? what kind of Linux?

red-magician-75203

08/19/2024, 9:51 PM

No it doesn’t stick every time. I have tried 10 installations out of which 2 of them have stuck so far. Once I delete the hung pods new pods are spawned and validation-webhook runs fine. But the old pods are still around.

Copy code

helm-install-rke2-snapshot-controller-4qlmp                        0/1     Completed     0             3m44s
helm-install-rke2-snapshot-controller-crd-hwgrv                    1/1     Terminating   0             8h
helm-install-rke2-snapshot-controller-crd-lnnpm                    0/1     Completed     0             28m
helm-install-rke2-snapshot-controller-q2d4z                        1/1     Terminating   10 (8h ago)   8h
helm-install-rke2-snapshot-validation-webhook-6j2xh                0/1     Completed     0             8h
rke2-snapshot-controller-5d7c74b69d-dkzhd                          1/1     Running       0             3m43s
rke2-snapshot-validation-webhook-85b76ccbb5-8rjf7                  1/1     Running       0             2m24s

red-magician-75203

08/19/2024, 9:52 PM

Running on RHEL 8.10

creamy-pencil-82913

08/19/2024, 9:53 PM

it is weird that they are stuck terminating. I would check the containerd logs. It sounds kinda like something is interrupting them while they’re running, and they hang and can’t even exit cleanly.

red-magician-75203

08/19/2024, 10:01 PM

How do I check those?

creamy-pencil-82913

08/19/2024, 10:10 PM

find containerd.log under /var/lib/rancher/rke2/agent

red-magician-75203

08/19/2024, 10:16 PM

seeing this repeatedly in the container.log

Copy code

time="2024-08-19T21:31:45.630374052Z" level=error msg="StopContainer for \"b714e4f9d6bdcd881efd10c5d18642a778b80c9b9dfb29dc25d8d6d97da2d9c6\" failed" error="rpc error: code = DeadlineExceeded desc = an error occurs during waiting for container \"b714e4f9d6bdcd881efd10c5d18642a778b80c9b9dfb29dc25d8d6d97da2d9c6\" to be killed: wait container \"b714e4f9d6bdcd881efd10c5d18642a778b80c9b9dfb29dc25d8d6d97da2d9c6\": context deadline exceeded"
time="2024-08-19T21:31:45.630625216Z" level=info msg="StopPodSandbox for \"f0fabd6d25105deb8d2e912524259b2c3fecd213e1345f504d16036753cde127\""
time="2024-08-19T21:31:45.630921831Z" level=info msg="Kill container \"b714e4f9d6bdcd881efd10c5d18642a778b80c9b9dfb29dc25d8d6d97da2d9c6\""
time="2024-08-19T21:33:45.631241025Z" level=error msg="StopPodSandbox for \"f0fabd6d25105deb8d2e912524259b2c3fecd213e1345f504d16036753cde127\" failed" error="rpc error: code = DeadlineExceeded desc = failed to stop container \"b714e4f9d6bdcd881efd10c5d18642a778b80c9b9dfb29dc25d8d6d97da2d9c6\": an error occurs during waiting for container \"b714e4f9d6bdcd881efd10c5d18642a778b80c9b9dfb29dc25d8d6d97da2d9c6\" to be killed: wait container \"b714e4f9d6bdcd881efd10c5d18642a778b80c9b9dfb29dc25d8d6d97da2d9c6\": context deadline exceeded"
time="2024-08-19T21:33:46.527610803Z" level=info msg="StopContainer for \"b714e4f9d6bdcd881efd10c5d18642a778b80c9b9dfb29dc25d8d6d97da2d9c6\" with timeout 30 (s)"
time="2024-08-19T21:33:46.528069494Z" level=info msg="Skipping the sending of signal terminated to container \"b714e4f9d6bdcd881efd10c5d18642a778b80c9b9dfb29dc25d8d6d97da2d9c6\" because a prior stop with timeout>0 request already sent the signal"

creamy-pencil-82913

08/19/2024, 10:30 PM

that is odd. Does this node have selinux enabled? is there anything interesting in the audit log?

red-magician-75203

08/19/2024, 10:30 PM

No selinux enabled. Where can I check the audit log

creamy-pencil-82913

08/19/2024, 10:31 PM

if selinux isn’t enabled then you probably wont have one…

creamy-pencil-82913

08/19/2024, 10:32 PM

Check from the start of the log, see if anything else weird is in there. it is pretty unusual for containers to be unkillable. I don’t think I’ve seen those errors before.

red-magician-75203

08/19/2024, 10:39 PM

Nothing funny going on there. It is just loading the plugins with no errors. Just this one error

Copy code

time="2024-08-19T13:11:59.114185290Z" level=error msg="failed to load cni during init, please check CRI plugin status before setting up network for pods" error="cni config load failed: no network config found in /etc/cni/net.d: cni plugin
 not initialized: failed to load cni config"

but then seeing error like these

Copy code

time="2024-08-19T13:12:49.899610225Z" level=error msg="Failed to destroy network for sandbox \"36e67da4647c5d14e8e952b4fa74c3058649aefae5c2ce44df000375ee278c97\"" error="plugin type=\"calico\" failed (delete): stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/"
time="2024-08-19T13:12:49.917235237Z" level=error msg="encountered an error cleaning up failed sandbox \"36e67da4647c5d14e8e952b4fa74c3058649aefae5c2ce44df000375ee278c97\", marking sandbox state as SANDBOX_UNKNOWN" error="plugin type=\"calico\" failed (delete): stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/"
time="2024-08-19T13:12:49.917314329Z" level=error msg="RunPodSandbox for &PodSandboxMetadata{Name:rke2-coredns-rke2-coredns-autoscaler-54b6894b8b-x7n2r,Uid:1c965b4c-bf30-4713-859d-2075752dc2cd,Namespace:kube-system,Attempt:0,} failed, err
or" error="failed to setup network for sandbox \"36e67da4647c5d14e8e952b4fa74c3058649aefae5c2ce44df000375ee278c97\": plugin type=\"calico\" failed (add): stat /var/lib/calico/nodename: no such file or directory: check that the calico/node
 container is running and has mounted /var/lib/calico/"
time="2024-08-19T13:12:49.990167329Z" level=error msg="Failed to destroy network for sandbox \"afcdccd407bd8177718ea0b5d05866a7bace598e83183499e046b88e0333dc68\"" error="plugin type=\"calico\" failed (delete): stat /var/lib/calico/nodenam
e: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/"

67 Views

Open in Slack

Previous Next