kind-air-74358
04/04/2025, 10:58 AM❯ kubectl get events
LAST SEEN TYPE REASON OBJECT MESSAGE
3m48s Warning SystemOOM node/nf46q-wwpkl (combined from similar events): System OOM encountered, victim process: tempo, pid: 1842943
Once this occured, when we try to get the logs from a pod running on this node it fails with the following error
❯ kubectl logs pod/pod-h7r9f -n ns
Error from server: Get "https://<node-ip>:10250/containerLogs/ns/pod-h7r9f/container": proxy error from 127.0.0.1:9345 while dialing <node-ip>:10250, code 502: 502 Bad Gateway
The kube-api reports the following error
E0404 08:28:24.906688 1 status.go:71] "Unhandled Error" err="apiserver received an error that is not an metav1.Status: &url.Error{Op:\"Get\", URL:\"https://<node-ip>:10250/containerLogs/ns/pod-h7r9f/container\", Err:(*errors.errorString)(0xc03d438df0)}: Get \"https://<node-ip>:10250/containerLogs/ns/pod-h7r9f/container\": proxy error from 127.0.0.1:9345 while dialing <node-ip>:10250, code 502: 502 Bad Gateway" logger="UnhandledError"
The rke2-agent on the worker node doesn't give any errors, the rke2-server on the control-plane does
Apr 04 08:28:24 control-plane-p2lr9 rke2[506030]: time="2025-04-04T08:28:24Z" level=error msg="Sending HTTP 502 response to 127.0.0.1:40172: failed to find Session for client nf46q-wwpkl"
Manually restarting the rke2-agent on the worker node resolved the issue, but we expected that once the OOMKiller removed the pod from the worker node, the error on kubectl get logs should be resolved, which is not happening.
N.b. we know that we should prevent the original pod claiming to much memory is probably the root cause (and we will fix that), but I would expect that the worker node itself should also recover.
Thanks in advancecurved-application-90990
04/09/2025, 6:13 PMkind-air-74358
04/18/2025, 6:34 AM