Hello folks. I've encountered an issue I seem to h...
# rke2
c
Hello folks. I've encountered an issue I seem to have resolved but don't quite understand why it got resolved or why it happened in the first place. Any insight would help me. Yesterday we were seeing an issue where every 40-60 minutes, my single node (yes it's embarassing) kubernetes cluster would die. This happened at that interval for 8 hours. Long story short, I was able to find messages in dmesg corresponding to the outages wherein I saw messages about "Memory cgroup out of memory: killed process blah". The processes being killed seemed to be Python and Ansible playbooks being spawned by AWX (this cluster runs AWX, which is the open source version of Ansible Automation Platform). I kept looking at the box and it had 25 GB of memory free out of 48 GB, so it wasn't a question of OS resource contention. I finally landed on the awx-operator deployment. It had been stuck in the 90% memory limit utilization range. I killed it and it came back at 5% memory limit utilization and has stayed there. Ever since then, the rke2 cluster has been stable, there have been no additional OOM kills in dmesg, and no outages. Looking further into dmesg, I see that there are messages about memory usage of 983040kB and limit of 983040kB. this corresponds to the memory limit of awx-operator. Now, maybe I'm missing something or my interpretation is wrong, but it seemed like awx-operator was complaining about running out of memory, informing kubernetes of it, and kubernetes responded by killing other containers rather than killing the awx-operator container. It's my understanding if a container hits its memory limit, kubernetes kills THAT container, and not other ones. I looked and awx-operator has priority: 0, there are no resource quotas on my namespace, pod usage is not close to approaching node limits. I'm running rke2 v1.29.15. So maybe my understanding is wrong, or I arrived at the wrong conclusions as to what my root cause was, but I am baffled and if anyone has any experience with this and can help me fill in a knowledge gap, you'll be a good person :)
c
but it seemed like awx-operator was complaining about running out of memory, informing kubernetes of it, and kubernetes responded by killing other containers rather than killing the awx-operator container.
that… is not how that works.
The kernel itself controls memory allocation based on cgroup limits. If the cgroup exceeds the limit, it is killed by the kernel. This is the same OOM-killer logic that is used if the node as a whole (the root cgroup) runs out of memory.
that is why you see the OOM messages in the kernel message log. Its the kernel that does the killing when things use more memory than they are allowed.
c
Thanks for the reply and helping educate my clown ass. So that's the gap I'm missing in my understanding. If one container hits its memory limit... what should be getting killed by the kernel?
c
a process in that container
Are you running AWX in Kubernetes? Is it a pod in the RKE2 cluster?
Also, do you have CPU limits on that pod, or only memory limits?
c
Yea, there's an awx-operator in the awx namespace, and an AWX CRD in the same namespace. The pod has default limits for cpu and memory, cpu is 1500m, memory is 960mi
c
What exactly do you mean by the RKE2 cluster “dying”
c
You helped clarify part of it for me because the OOM killed processes on the box seemed to be python3 and ansible-playbook processes. I had assumed those were from AWX job launches, but those are also part of the things running in the awx-operator. Re: dying: my connection to the cluster via k9s terminated several (but not all) times and I noticed certain kube-system processes like canal, kube-proxy being restarted. These coincided with the OOM kills, but it did not occur every time.
c
I probably would have investigated why those were getting restarted. idk how your AWX deployment is set up, but if it ends up thrashing the disk because it is running out of memory and getting restarted all the time, that can affect the performance of etcd (if they share disk) and then cause greater system instability.
c
One of the job failures from an AWX job launch did have the error: "receptor detail: error creating pod: etcdserver: request timed out" so that does make sense. The thing about it is that the awx-operator was stuck at 90%+ memory utilization for all day yesterday. It wasn't until I got desperate and out of ideas that I killed it and when it came back the utilization was back to 5% limit utilization. I guess your earlier explanation that it'd just kill pids inside the container solves why it didn't "restart" the container like I'd assume it would. Appreciate it. Thanks for the insights!
c
You might consider putting your workload or the etcd datastore on a different ebs volume
c
That's an excellent idea. Is there documentation on how one might be able to do such a migration or would it have to be a new build?
Would it basically amount to creating a dedicated /var/lib/rancher/rke2/server/db/etcd volume and doing a migration while rke2 is offline?
c
I would do all of /var/lib/rancher/rke2/server/db but yeah
👍 1
just run rke2-killall to make sure everything is stopped, copy the stuff over to the new disk, then remount it at the correct path
c
Ok. Thank you very much for helping me! You are a good person.
👍 1