Hi all, we're having a problem that mainly seems t...
# general
a
Hi all, we're having a problem that mainly seems to happen to out
cattle-cluster-agents
so I hope the Rancher community can help. We're running a hybrid infrastructure with on-prem (vSphere based) Kubernetes clusters and AKS clusters at Azure. All are version 1.30.4 and 1.30.5. Rancher is version 2.9x. The problem was also in 2.8.x. What we're seeing on all our AKS nodes (9 clusters in total, about 75 nodes), is that after running for a while, mostly 2 to 4 weeks, our
cattle-cluster-agents
begin restarting. In the eventlogs of these crashing pods, we see a huge amount of the below:
Copy code
Normal   SandboxChanged  17m (x1170 over 5d4h)     kubelet  Pod sandbox changed, it will be killed and re-created.
Of course we did our due diligence and performed the required troubleshooting. We ensured memory and CPU can't be an issue, we ensured nothing is blocked on a firewall, we've checked
containerd
config (most notably the systemd cgroups), refreshed
containerd
config, checked
kubelet
logs, checked
kubeproxy
logs, checked
containerd
logs, etc. etc. This is from the
rancher
pod logs at the manager at the time of the
cattle-cluster-agent
crash:
Copy code
2024/11/21 07:37:06 [INFO] error in remotedialer server [400]: websocket: close 1006 (abnormal closure): unexpected EOF
This is the tail from the cattle-cluster-agent right before it crashes:
Copy code
W1121 11:27:24.673193      56 warnings.go:70] v1 ComponentStatus is deprecated in v1.19+
time="2024-11-21T11:56:57Z" level=warning msg="signal received: \"terminated\", canceling context..."
time="2024-11-21T11:56:57Z" level=info msg="Shutting down management.cattle.io/v3, Kind=GroupMember workers"
time="2024-11-21T11:56:57Z" level=info msg="Shutting down management.cattle.io/v3, Kind=Group workers"
time="2024-11-21T11:56:57Z" level=info msg="Shutting down management.cattle.io/v3, Kind=UserAttribute workers"
time="2024-11-21T11:56:57Z" level=info msg="Shutting down /v1, Kind=Secret workers"
time="2024-11-21T11:56:57Z" level=info msg="Shutting down /v1, Kind=Secret workers"
time="2024-11-21T11:56:57Z" level=fatal msg="Embedded rancher failed to start: context canceled"
We cannot find why this is happening. I'll attach a piece of log from
journalctl
with what I believe captures one occurrence of such a restart. Updating the node image on the AKS nodes or simply creating new AKS nodes and deleting the old ones solve this problem, but only for about 2 to 4 weeks. Then the problem reoccurs and we start all over again. Our on-prem nodes do not experience this behavior. Any tips and help is greatly appreciated.