Hello folks. I've encountered an issue I seem to have resolved but don't quite understand why it got resolved or why it happened in the first place. Any insight would help me.
Yesterday we were seeing an issue where every 40-60 minutes, my single node (yes it's embarassing) kubernetes cluster would die. This happened at that interval for 8 hours. Long story short, I was able to find messages in dmesg corresponding to the outages wherein I saw messages about "Memory cgroup out of memory: killed process
blah". The processes being killed seemed to be Python and Ansible playbooks being spawned by AWX (this cluster runs AWX, which is the open source version of Ansible Automation Platform). I kept looking at the box and it had 25 GB of memory free out of 48 GB, so it wasn't a question of OS resource contention.
I finally landed on the awx-operator deployment. It had been stuck in the 90% memory limit utilization range. I killed it and it came back at 5% memory limit utilization and has stayed there. Ever since then, the rke2 cluster has been stable, there have been no additional OOM kills in dmesg, and no outages. Looking further into dmesg, I see that there are messages about memory usage of 983040kB and limit of 983040kB. this corresponds to the memory limit of awx-operator.
Now, maybe I'm missing something or my interpretation is wrong, but it seemed like awx-operator was complaining about running out of memory, informing kubernetes of it, and kubernetes responded by killing
other containers rather than killing the awx-operator container. It's my understanding if a container hits its memory limit, kubernetes kills THAT container, and not other ones.
I looked and awx-operator has priority: 0, there are no resource quotas on my namespace, pod usage is not close to approaching node limits. I'm running rke2 v1.29.15.
So maybe my understanding is wrong, or I arrived at the wrong conclusions as to what my root cause was, but I am baffled and if anyone has any experience with this and can help me fill in a knowledge gap, you'll be a good person :)