This message was deleted.
# rke2
a
This message was deleted.
c
but it seemed like awx-operator was complaining about running out of memory, informing kubernetes of it, and kubernetes responded by killing other containers rather than killing the awx-operator container.
that… is not how that works.
The kernel itself controls memory allocation based on cgroup limits. If the cgroup exceeds the limit, it is killed by the kernel. This is the same OOM-killer logic that is used if the node as a whole (the root cgroup) runs out of memory.
that is why you see the OOM messages in the kernel message log. Its the kernel that does the killing when things use more memory than they are allowed.
c
Thanks for the reply and helping educate my clown ass. So that's the gap I'm missing in my understanding. If one container hits its memory limit... what should be getting killed by the kernel?
c
a process in that container
Are you running AWX in Kubernetes? Is it a pod in the RKE2 cluster?
Also, do you have CPU limits on that pod, or only memory limits?
c
Yea, there's an awx-operator in the awx namespace, and an AWX CRD in the same namespace. The pod has default limits for cpu and memory, cpu is 1500m, memory is 960mi
c
What exactly do you mean by the RKE2 cluster “dying”
c
You helped clarify part of it for me because the OOM killed processes on the box seemed to be python3 and ansible-playbook processes. I had assumed those were from AWX job launches, but those are also part of the things running in the awx-operator. Re: dying: my connection to the cluster via k9s terminated several (but not all) times and I noticed certain kube-system processes like canal, kube-proxy being restarted. These coincided with the OOM kills, but it did not occur every time.
c
I probably would have investigated why those were getting restarted. idk how your AWX deployment is set up, but if it ends up thrashing the disk because it is running out of memory and getting restarted all the time, that can affect the performance of etcd (if they share disk) and then cause greater system instability.
c
One of the job failures from an AWX job launch did have the error: "receptor detail: error creating pod: etcdserver: request timed out" so that does make sense. The thing about it is that the awx-operator was stuck at 90%+ memory utilization for all day yesterday. It wasn't until I got desperate and out of ideas that I killed it and when it came back the utilization was back to 5% limit utilization. I guess your earlier explanation that it'd just kill pids inside the container solves why it didn't "restart" the container like I'd assume it would. Appreciate it. Thanks for the insights!
c
You might consider putting your workload or the etcd datastore on a different ebs volume
c
That's an excellent idea. Is there documentation on how one might be able to do such a migration or would it have to be a new build?
Would it basically amount to creating a dedicated /var/lib/rancher/rke2/server/db/etcd volume and doing a migration while rke2 is offline?
c
I would do all of /var/lib/rancher/rke2/server/db but yeah
👍 1
just run rke2-killall to make sure everything is stopped, copy the stuff over to the new disk, then remount it at the correct path
c
Ok. Thank you very much for helping me! You are a good person.
👍 1