Hi all we re having a problem that mainly seems to happen to Rancher Users #general

Hi all, we're having a problem that mainly seems t...

ambitious-island-4760

11/21/2024, 7:59 AM

Hi all, we're having a problem that mainly seems to happen to out

cattle-cluster-agents

so I hope the Rancher community can help. We're running a hybrid infrastructure with on-prem (vSphere based) Kubernetes clusters and AKS clusters at Azure. All are version 1.30.4 and 1.30.5. Rancher is version 2.9x. The problem was also in 2.8.x. What we're seeing on all our AKS nodes (9 clusters in total, about 75 nodes), is that after running for a while, mostly 2 to 4 weeks, our

cattle-cluster-agents

begin restarting. In the eventlogs of these crashing pods, we see a huge amount of the below:

Copy code

Normal   SandboxChanged  17m (x1170 over 5d4h)     kubelet  Pod sandbox changed, it will be killed and re-created.

Of course we did our due diligence and performed the required troubleshooting. We ensured memory and CPU can't be an issue, we ensured nothing is blocked on a firewall, we've checked

containerd

config (most notably the systemd cgroups), refreshed

containerd

config, checked

kubelet

logs, checked

kubeproxy

logs, checked

containerd

logs, etc. etc. This is from the

rancher

pod logs at the manager at the time of the

cattle-cluster-agent

crash:

Copy code

2024/11/21 07:37:06 [INFO] error in remotedialer server [400]: websocket: close 1006 (abnormal closure): unexpected EOF

This is the tail from the cattle-cluster-agent right before it crashes:

Copy code

W1121 11:27:24.673193      56 warnings.go:70] v1 ComponentStatus is deprecated in v1.19+
time="2024-11-21T11:56:57Z" level=warning msg="signal received: \"terminated\", canceling context..."
time="2024-11-21T11:56:57Z" level=info msg="Shutting down management.cattle.io/v3, Kind=GroupMember workers"
time="2024-11-21T11:56:57Z" level=info msg="Shutting down management.cattle.io/v3, Kind=Group workers"
time="2024-11-21T11:56:57Z" level=info msg="Shutting down management.cattle.io/v3, Kind=UserAttribute workers"
time="2024-11-21T11:56:57Z" level=info msg="Shutting down /v1, Kind=Secret workers"
time="2024-11-21T11:56:57Z" level=info msg="Shutting down /v1, Kind=Secret workers"
time="2024-11-21T11:56:57Z" level=fatal msg="Embedded rancher failed to start: context canceled"

We cannot find why this is happening. I'll attach a piece of log from

journalctl

with what I believe captures one occurrence of such a restart. Updating the node image on the AKS nodes or simply creating new AKS nodes and deleting the old ones solve this problem, but only for about 2 to 4 weeks. Then the problem reoccurs and we start all over again. Our on-prem nodes do not experience this behavior. Any tips and help is greatly appreciated.

journalctl.txt

127 Views

Open in Slack

Previous Next