RKE2 Disk IOPs / latency - we have a 3 server + 6 ...
# rke2
b
RKE2 Disk IOPs / latency - we have a 3 server + 6 agent node cluster with ~250 pods total running 1.31.2 on AWS. We are frequently experiencing cluster instability and the server nodes locking up. When this happens, the server node volumes saturate their IOPs. The EBS GP3 volumes have now got 8000 IOPS and 300MiB/s throughput but are still saturating it when they lock up. We have a similar sized cluster running on the same OS and rke2 version on-prem which never gets close to this level of disk IO. Has anyone seen this before? What IOPs do people run with in AWS?
When the cluster is stable IOPs are much smaller but redeploys, pod migrations etc seem to trigger 30x spikes
Might be a symptom rather than a cause...
etcd stats
c
I’d probably try to figure out what’s thrashing the disks. Is it etcd? is it image pull/prune? Is it your workload?
b
Our latest theory is it's our workload somehow causing a huge spike in etcd writes by thrashing the API server. Unfortunately when the server nodes go down we lose all access and detailed metrics stop reporting. The control plane AWS NLB is being swarmed by something. We will go through audit and any other logging we can find to see what is causing these spikes. This post was probably more suited to the k8s slack but if anyone has any general tips of what to look for and where, they would be appreciated!
the control plane nodes have the
CriticalAddonsOnly=true:NoExecute
taint so all disk IO should come indirectly from our workloads
I'll keep this as a running log, we shutdown all the agents, kept the servers on, waited 5 minutes and turned all the agents back on again. 2/3 servers were fine. 1 went down for 30 minutes but recovered without restart. iotop and iostat was running in the background for all servers but on the 1 that failed, all logs stopped (including /var/log/messages) when it was unresponsive. Given 2/3 servers were fine, fundamentally it must be fine and there is an error condition somewhere (not necessarily rke2) that causes a cascading failure. In the screenshot the left server went down, the right one is fine and didn't come close to the disk io of server1
iotop from around the problematic time gives multiple readings for processes which suggests they are being restarted
Copy code
08:55:39 Actual DISK READ:     127.63 M/s | Actual DISK WRITE:     654.91 K/s
b'08:55:39    2085 be/4 etcd        2.22 M/s  115.27 K/s ?unavailable?  etcd --config-file=/var/lib/
b'08:55:39  244376 be/4 root        2.11 M/s  135.36 B/s ?unavailable?  containerd -c /var/lib/ranch
b'08:55:39    3354 be/4 root        2.34 M/s    0.00 B/s ?unavailable?  cilium-agent --config-dir=/t
b'08:55:39    3412 be/4 root        3.45 M/s    0.00 B/s ?unavailable?  cilium-agent --config-dir=/t
b'08:55:39    3749 be/4 root        2.30 M/s    0.00 B/s ?unavailable?  cilium-agent --config-dir=/t
b'08:55:39    3766 be/4 root        3.27 M/s    0.00 B/s ?unavailable?  cilium-agent --config-dir=/t
b'08:55:39    3865 be/4 root        2.71 M/s    0.00 B/s ?unavailable?  coredns -conf /etc/coredns/C
b'08:55:39  244118 be/4 root        2.62 M/s    0.00 B/s ?unavailable?  kube-scheduler --permit-port
b'08:55:39  244160 be/4 root        2.22 M/s    0.00 B/s ?unavailable?  rke2 server'
b'08:55:39  244162 be/4 root        3.05 M/s    0.00 B/s ?unavailable?  rke2 server'
b'08:55:39  244349 be/4 root        4.48 M/s    0.00 B/s ?unavailable?  kubelet --volume-plugin-dir=
b'08:55:39  244356 be/4 root        3.39 M/s    0.00 B/s ?unavailable?  kubelet --volume-plugin-dir=
b'08:55:39  244361 be/4 root        4.44 M/s    0.00 B/s ?unavailable?  kubelet --volume-plugin-dir=
b'08:55:39  244762 be/4 root        3.55 M/s    0.00 B/s ?unavailable?  kube-controller-manager --fl
b'08:55:39  245001 be/4 root        4.89 M/s    0.00 B/s ?unavailable?  kube-controller-manager --fl
b'08:55:39  245160 be/4 root        3.80 M/s    0.00 B/s ?unavailable?  kube-controller-manager --fl
b'08:55:39  245151 be/4 root        3.14 M/s    0.00 B/s ?unavailable?  kube-apiserver --admission-c
b'08:55:39  245158 be/4 root        3.96 M/s    0.00 B/s ?unavailable?  kube-apiserver --admission-c
b'08:55:39  245159 be/4 root        3.92 M/s    0.00 B/s ?unavailable?  kube-apiserver --admission-c
Found some OOM in system messages kubeapi got to 1.8GB, rke2 was at 707MB, etcd 300MB. Happy usage is 500MB for kubeapi and 350MB for rke2. The servers are 2CPU 4GB RAM based on https://docs.rke2.io/install/requirements#vm-sizing-guide So looks likely it was being overwhelmed and stuck in a reboot loop. Unsure why the other other 2 servers are fine
All are seemingly symptoms of a large etcd. When we manually ran a defrag / compact and it went from 500mb to 40Mb but has already increased to 140MB
c
These are all very low resource levels. 2 core 4gb is like, the lowest possible amount that will even schedule all the server pods. And a couple hundred mb in etcd is nothing. It sounds like you just need to increase resources from bare minimum.
b
Sounds like we took the vm sizing guide too much at face value. Will increase the sizes. Would you recommended setting up auto etcd defrag at all or it it generally not needed with the auto compact
c
we don’t recommend doing it while it’s running since it pauses all IO operations. It is auto defragged whenever you start the rke2-server service. While it’s running I would just allow it to be at whatever size it is at, a couple hundred MB for pages that will eventually get reused shouldn’t be a lot to ask.
b
Thanks
m
Good read...