This message was deleted.
# k3s
a
This message was deleted.
f
So far, I've really only thought to look at the kernel log message:
Copy code
kernel: oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=cri-containerd-2ffcdeac7a102eb3f6f49b43ae4afa36589bb9765272cf953fe7ee15a3a3cc67.scope,mems_allowed=0,oom_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podf27c27f9_1a8b_40bf_80a0_339b93de7be2.slice/cri-containerd-2ffcdeac7a102eb3f6f49b43ae4afa36589bb9765272cf953fe7ee15a3a3cc67.scope,task_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podf27c27f9_1a8b_40bf_80a0_339b93de7be2.slice/cri-containerd-2ffcdeac7a102eb3f6f49b43ae4afa36589bb9765272cf953fe7ee15a3a3cc67.scope,task=PROCESS,pid=3624942,uid=0
And then look at status.containerStatuses[].containerID to see if I find the
2ffcd...
in there somewhere, but it's a somewhat laborious process
c
don’t you get a kubernetes event for the pod when a container is restarted?
f
The container isn't restarted. The root process is sitting there happy-as-can-be, but one of its subprocesses borked
So, the container/pod never died
c
ah I see. in that case its the responsibility of pid 1 in the container to do something about it. The container runtime doesn’t get the signal indicating that a child process has exited for anything other than pid 1.
are you running a process supervisor or something in that container? It’s considered somewhat of an anti-pattern to run multiple processes in a single container, for this very reason. Makes it hard for the container runtime and kubelet to handle process termination events.
f
Yup, I get the idea, but at the same time, when workloads wind up on our cluster that don't behave that way, I'd like to track down where those errors came from. I'm not necessarily the workload author
In this case, it's a python process that's using multiprocessing.Process()s to get some parallelism
I've actually fixed this specific workload, but I'm looking for a more general solution
IE: What happens when someone rolls out a workload that doesn't abide by these best practices, how can I figure out which of the 500 pods it is?
This time, I lucked out
c
there are a couple moving pieces here
1, the kubelet and container runtime try to handle memory limits themselves so that the kernel OOM killer doesn’t get involved. If you have memory limits on the pods, kubernetes tries to kill the container when it hits the limit so that it can handle restarting it properly.
2. if the kernel does get involved, the kubelet watches the kernel message log (dmesg) so that it can see what process got OOM killed, and tries to handle restarting it
f
Oh interesting, I figured the kernel OOM killer did the bulk of the work
c
Does the pod in question have memory limits set? Does the node have swap enabled?
f
The pod has memory limits and the nodes do not have swap enabled
c
is the pod as a whole exceeding its memory limit when that process gets OOM killed?
f
You know, I'm not entirely sure, but I believe it's probably the the pod total going over, and then one unlucky process gets taken out. The node itself has gobs of memory free
The workload itself is limited to like 2GB of memory, and the node has 128GB of ram and sits at like 20-30% utilization
c
I think that’s kinda how it works. If the OOM killer gets to the pod first, and picks a process that isn’t pid 1, it’s not visible to the container runtime.
The app itself needs to be responsible for handling that, whether it’s restarting the child process, or exiting out so the container can be restarted.
f
Makes sense; That particular pod has been adjusted to do just that -- if a worker disappears, it'll shutdown.
If your main process is forking, it should be fail if a critical child fails / restart the child itself, or you need to use a liveness probe.
It’s expected that child processes might exit for various reasons and it would be a breaking change to restart pods when a child process exits.
f
Makes sense now. I still wish there was a half-decent way to track this down. I'm combing through logs to see if I can track the containerd-id back to pod information somehow
c
yeah, tracking the cgroup ID back to a container/pod isn’t easy without poking at the internals
f
I see the container-id in the pod status information. I'm wondering if that's getting stored in a log somewhere that gets bubbled up to our logging system
So question on best-practices. We've been led to believe that it's best to set memory request/limits to the same thing, and set cpu requests and generally leave limits empty. Is this still a general good-practice?
I'm trying to think of a scenario where allowing a request/limit to not match up might make sense and/or setting a CPU limit might make sense
c
people have feelings about that. I’m not sure there’s a single “right way” to do it, what works best depends on how you want things scheduled and limits enforced.
f
Sure thing, curious if there's any resources as to when someone would choose one approach over another
I'm mostly curious about competing opinions/approaches to it. At any rate, I think I have the best answer I can for now and I learned some things about kubelet
a
May me metrics can help ? node exporter
node_vmstat_oom_kill
c
@future-fountain-82544 possibly helpful for your use case: https://github.com/kubernetes/kubernetes/pull/117793
f
Ooh ty
138 Views