Hello, does anyone have any advice on the followin...
# rke2
k
Hello, does anyone have any advice on the following issue? We are running a RKE2 cluster (managed via Rancher) which uses Cilium without the kube-proxy configured. At some point a pod is claiming to much memory, it gets killed and a event is raised;
Copy code
❯ kubectl get events
LAST SEEN   TYPE      REASON                        OBJECT                                                         MESSAGE
3m48s       Warning   SystemOOM                     node/nf46q-wwpkl   (combined from similar events): System OOM encountered, victim process: tempo, pid: 1842943
Once this occured, when we try to get the logs from a pod running on this node it fails with the following error
Copy code
❯ kubectl logs pod/pod-h7r9f -n ns
Error from server: Get "https://<node-ip>:10250/containerLogs/ns/pod-h7r9f/container": proxy error from 127.0.0.1:9345 while dialing <node-ip>:10250, code 502: 502 Bad Gateway
The kube-api reports the following error
Copy code
E0404 08:28:24.906688       1 status.go:71] "Unhandled Error" err="apiserver received an error that is not an metav1.Status: &url.Error{Op:\"Get\", URL:\"https://<node-ip>:10250/containerLogs/ns/pod-h7r9f/container\", Err:(*errors.errorString)(0xc03d438df0)}: Get \"https://<node-ip>:10250/containerLogs/ns/pod-h7r9f/container\": proxy error from 127.0.0.1:9345 while dialing <node-ip>:10250, code 502: 502 Bad Gateway" logger="UnhandledError"
The rke2-agent on the worker node doesn't give any errors, the rke2-server on the control-plane does
Copy code
Apr 04 08:28:24 control-plane-p2lr9 rke2[506030]: time="2025-04-04T08:28:24Z" level=error msg="Sending HTTP 502 response to 127.0.0.1:40172: failed to find Session for client nf46q-wwpkl"
Manually restarting the rke2-agent on the worker node resolved the issue, but we expected that once the OOMKiller removed the pod from the worker node, the error on kubectl get logs should be resolved, which is not happening. N.b. we know that we should prevent the original pod claiming to much memory is probably the root cause (and we will fix that), but I would expect that the worker node itself should also recover. Thanks in advance
c
Unfortunately it does not recover from that. I have seen this in the last few days a few times. I will dive further in this
k
@curved-application-90990, thanks for your reply. Did you got any further with this?