This message was deleted.
# rke2
a
This message was deleted.
c
Are those helm install jobs actually still running? They should take no longer than a few seconds.
check the pod logs perhaps?
r
Yeah they are still running. The pod logs are just stuck
Copy code
[root@ip-10-1-0-24 manifests]# kubectl logs helm-install-rke2-snapshot-controller-crd-hwgrv -n kube-system
if [[ ${KUBERNETES_SERVICE_HOST} =~ .*:.* ]]; then
	echo "KUBERNETES_SERVICE_HOST is using IPv6"
	CHART="${CHART//%\{KUBERNETES_API\}%/[${KUBERNETES_SERVICE_HOST}]:${KUBERNETES_SERVICE_PORT}}"
else
	CHART="${CHART//%\{KUBERNETES_API\}%/${KUBERNETES_SERVICE_HOST}:${KUBERNETES_SERVICE_PORT}}"
fi

set +v -x
+ [[ '' != \t\r\u\e ]]
+ export HELM_HOST=127.0.0.1:44134
+ HELM_HOST=127.0.0.1:44134
+ tiller --listen=127.0.0.1:44134 --storage=secret
+ helm_v2 init --skip-refresh --client-only --stable-repo-url <https://charts.helm.sh/stable/>
[main] 2024/08/19 13:13:02 Starting Tiller v2.17.0 (tls=false)
[main] 2024/08/19 13:13:02 GRPC listening on 127.0.0.1:44134
[main] 2024/08/19 13:13:02 Probes listening on :44135
[main] 2024/08/19 13:13:02 Storage driver is Secret
[main] 2024/08/19 13:13:02 Max history per release is 0
c
are they both stuck at the same spot?
r
The validation-webhook is failing because it is not able to find the CRDs for snapshot storage.k8s.io
Copy code
[root@ip-10-1-0-24 manifests]# kubectl logs helm-install-rke2-snapshot-controller-q2d4z -n kube-system
if [[ ${KUBERNETES_SERVICE_HOST} =~ .*:.* ]]; then
	echo "KUBERNETES_SERVICE_HOST is using IPv6"
	CHART="${CHART//%\{KUBERNETES_API\}%/[${KUBERNETES_SERVICE_HOST}]:${KUBERNETES_SERVICE_PORT}}"

....

+ [[ null =~ ^(|null)$ ]]
+ [[ null =~ ^(|null)$ ]]
+ echo 'Installing helm_v3 chart'
+ helm_v3 install --set-string global.cattle.systemDefaultRegistry=10.1.0.12 --set-string global.clusterCIDR=10.42.0.0/16 --set-string global.clusterCIDRv4=10.42.0.0/16 --set-string global.clusterDNS=10.43.0.10 --set-string global.clusterDomain=cluster.local --set-string global.rke2DataDir=/var/lib/rancher/rke2 --set-string global.serviceCIDR=10.43.0.0/16 --set-string global.systemDefaultIngressClass=ingress-nginx --set-string global.systemDefaultRegistry=10.1.0.12 rke2-snapshot-controller /tmp/rke2-snapshot-controller.tgz
Error: INSTALLATION FAILED: execution error at (rke2-snapshot-controller/templates/validate-install-crd.yaml:13:7): Required CRDs are missing. Please install the corresponding CRD chart before installing this chart.
+ exit
It say it is exitiing but the pod is still running.
Copy code
helm-install-rke2-snapshot-controller-q2d4z                        1/1     Running       10 (8h ago)     8h
c
Try deleting that pod and see if it re-runs successfully. I’ve never seen it hang at that specific spot before.
something else odd going on with the node that pod is running on perhaps?
r
Any pointers where to look at in the node?
Yeah, I have seen this a couple of times over the weekend. I have tried installing at least 10 clusters
c
what distro are you installing on? Is anything else going on with the nodes at the time that job is trying to run?
r
I am running on AWS. Nothing else is running on the node. Just installation of RKE2.
c
That command should finish quickly, and then it should run a
helm_v2 ls
with a 30 second timeout that should also finish very quickly as there should not be any legacy helm v2 charts
does it stick there every time? or does it run successfully after you delete the pod so that it can run again?
running what on AWS? what kind of Linux?
r
No it doesn’t stick every time. I have tried 10 installations out of which 2 of them have stuck so far. Once I delete the hung pods new pods are spawned and validation-webhook runs fine. But the old pods are still around.
Copy code
helm-install-rke2-snapshot-controller-4qlmp                        0/1     Completed     0             3m44s
helm-install-rke2-snapshot-controller-crd-hwgrv                    1/1     Terminating   0             8h
helm-install-rke2-snapshot-controller-crd-lnnpm                    0/1     Completed     0             28m
helm-install-rke2-snapshot-controller-q2d4z                        1/1     Terminating   10 (8h ago)   8h
helm-install-rke2-snapshot-validation-webhook-6j2xh                0/1     Completed     0             8h
rke2-snapshot-controller-5d7c74b69d-dkzhd                          1/1     Running       0             3m43s
rke2-snapshot-validation-webhook-85b76ccbb5-8rjf7                  1/1     Running       0             2m24s
Running on RHEL 8.10
c
it is weird that they are stuck terminating. I would check the containerd logs. It sounds kinda like something is interrupting them while they’re running, and they hang and can’t even exit cleanly.
r
How do I check those?
c
find containerd.log under /var/lib/rancher/rke2/agent
r
seeing this repeatedly in the container.log
Copy code
time="2024-08-19T21:31:45.630374052Z" level=error msg="StopContainer for \"b714e4f9d6bdcd881efd10c5d18642a778b80c9b9dfb29dc25d8d6d97da2d9c6\" failed" error="rpc error: code = DeadlineExceeded desc = an error occurs during waiting for container \"b714e4f9d6bdcd881efd10c5d18642a778b80c9b9dfb29dc25d8d6d97da2d9c6\" to be killed: wait container \"b714e4f9d6bdcd881efd10c5d18642a778b80c9b9dfb29dc25d8d6d97da2d9c6\": context deadline exceeded"
time="2024-08-19T21:31:45.630625216Z" level=info msg="StopPodSandbox for \"f0fabd6d25105deb8d2e912524259b2c3fecd213e1345f504d16036753cde127\""
time="2024-08-19T21:31:45.630921831Z" level=info msg="Kill container \"b714e4f9d6bdcd881efd10c5d18642a778b80c9b9dfb29dc25d8d6d97da2d9c6\""
time="2024-08-19T21:33:45.631241025Z" level=error msg="StopPodSandbox for \"f0fabd6d25105deb8d2e912524259b2c3fecd213e1345f504d16036753cde127\" failed" error="rpc error: code = DeadlineExceeded desc = failed to stop container \"b714e4f9d6bdcd881efd10c5d18642a778b80c9b9dfb29dc25d8d6d97da2d9c6\": an error occurs during waiting for container \"b714e4f9d6bdcd881efd10c5d18642a778b80c9b9dfb29dc25d8d6d97da2d9c6\" to be killed: wait container \"b714e4f9d6bdcd881efd10c5d18642a778b80c9b9dfb29dc25d8d6d97da2d9c6\": context deadline exceeded"
time="2024-08-19T21:33:46.527610803Z" level=info msg="StopContainer for \"b714e4f9d6bdcd881efd10c5d18642a778b80c9b9dfb29dc25d8d6d97da2d9c6\" with timeout 30 (s)"
time="2024-08-19T21:33:46.528069494Z" level=info msg="Skipping the sending of signal terminated to container \"b714e4f9d6bdcd881efd10c5d18642a778b80c9b9dfb29dc25d8d6d97da2d9c6\" because a prior stop with timeout>0 request already sent the signal"
c
that is odd. Does this node have selinux enabled? is there anything interesting in the audit log?
r
No selinux enabled. Where can I check the audit log
c
if selinux isn’t enabled then you probably wont have one…
Check from the start of the log, see if anything else weird is in there. it is pretty unusual for containers to be unkillable. I don’t think I’ve seen those errors before.
r
Nothing funny going on there. It is just loading the plugins with no errors. Just this one error
Copy code
time="2024-08-19T13:11:59.114185290Z" level=error msg="failed to load cni during init, please check CRI plugin status before setting up network for pods" error="cni config load failed: no network config found in /etc/cni/net.d: cni plugin
 not initialized: failed to load cni config"
but then seeing error like these
Copy code
time="2024-08-19T13:12:49.899610225Z" level=error msg="Failed to destroy network for sandbox \"36e67da4647c5d14e8e952b4fa74c3058649aefae5c2ce44df000375ee278c97\"" error="plugin type=\"calico\" failed (delete): stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/"
time="2024-08-19T13:12:49.917235237Z" level=error msg="encountered an error cleaning up failed sandbox \"36e67da4647c5d14e8e952b4fa74c3058649aefae5c2ce44df000375ee278c97\", marking sandbox state as SANDBOX_UNKNOWN" error="plugin type=\"calico\" failed (delete): stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/"
time="2024-08-19T13:12:49.917314329Z" level=error msg="RunPodSandbox for &PodSandboxMetadata{Name:rke2-coredns-rke2-coredns-autoscaler-54b6894b8b-x7n2r,Uid:1c965b4c-bf30-4713-859d-2075752dc2cd,Namespace:kube-system,Attempt:0,} failed, err
or" error="failed to setup network for sandbox \"36e67da4647c5d14e8e952b4fa74c3058649aefae5c2ce44df000375ee278c97\": plugin type=\"calico\" failed (add): stat /var/lib/calico/nodename: no such file or directory: check that the calico/node
 container is running and has mounted /var/lib/calico/"
time="2024-08-19T13:12:49.990167329Z" level=error msg="Failed to destroy network for sandbox \"afcdccd407bd8177718ea0b5d05866a7bace598e83183499e046b88e0333dc68\"" error="plugin type=\"calico\" failed (delete): stat /var/lib/calico/nodenam
e: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/"