ambitious-island-4760
11/21/2024, 7:59 AMcattle-cluster-agents
so I hope the Rancher community can help.
We're running a hybrid infrastructure with on-prem (vSphere based) Kubernetes clusters and AKS clusters at Azure. All are version 1.30.4 and 1.30.5. Rancher is version 2.9x. The problem was also in 2.8.x. What we're seeing on all our AKS nodes (9 clusters in total, about 75 nodes), is that after running for a while, mostly 2 to 4 weeks, our cattle-cluster-agents
begin restarting.
In the eventlogs of these crashing pods, we see a huge amount of the below:
Normal SandboxChanged 17m (x1170 over 5d4h) kubelet Pod sandbox changed, it will be killed and re-created.
Of course we did our due diligence and performed the required troubleshooting. We ensured memory and CPU can't be an issue, we ensured nothing is blocked on a firewall, we've checked containerd
config (most notably the systemd cgroups), refreshed containerd
config, checked kubelet
logs, checked kubeproxy
logs, checked containerd
logs, etc. etc.
This is from the rancher
pod logs at the manager at the time of the cattle-cluster-agent
crash:
2024/11/21 07:37:06 [INFO] error in remotedialer server [400]: websocket: close 1006 (abnormal closure): unexpected EOF
This is the tail from the cattle-cluster-agent right before it crashes:
W1121 11:27:24.673193 56 warnings.go:70] v1 ComponentStatus is deprecated in v1.19+
time="2024-11-21T11:56:57Z" level=warning msg="signal received: \"terminated\", canceling context..."
time="2024-11-21T11:56:57Z" level=info msg="Shutting down management.cattle.io/v3, Kind=GroupMember workers"
time="2024-11-21T11:56:57Z" level=info msg="Shutting down management.cattle.io/v3, Kind=Group workers"
time="2024-11-21T11:56:57Z" level=info msg="Shutting down management.cattle.io/v3, Kind=UserAttribute workers"
time="2024-11-21T11:56:57Z" level=info msg="Shutting down /v1, Kind=Secret workers"
time="2024-11-21T11:56:57Z" level=info msg="Shutting down /v1, Kind=Secret workers"
time="2024-11-21T11:56:57Z" level=fatal msg="Embedded rancher failed to start: context canceled"
We cannot find why this is happening. I'll attach a piece of log from journalctl
with what I believe captures one occurrence of such a restart.
Updating the node image on the AKS nodes or simply creating new AKS nodes and deleting the old ones solve this problem, but only for about 2 to 4 weeks. Then the problem reoccurs and we start all over again.
Our on-prem nodes do not experience this behavior. Any tips and help is greatly appreciated.