anyone seen this issue "apply request took too lon...
# rke2
a
anyone seen this issue "apply request took too long" from etcd in a rancher RKE2 cluster coming from the etcd workload, can i tune the expected-duration value?
c
this generally means your disk is too slow or you have insufficient CPU resources
and no, the solution is not to simply tell etcd to expect things to take longer
a
is there anywhere i can look in the grafana monitoring which will help me pinpoint what exactly it is causing this
i've checked the CPU/mem requests on the etcd pods and its well within the limits
c
check your disk io latency stats
make sure that you are not using rotational storage, and that etcd is not contesting with your workload IO or image pulls
❤️ 1
a
thanks i will investigate further
i believe its SSDs but i'm running RKE on VM's
the storage layer is a ceph cluster
c
etcd issues a fsync per write to force all outstanding io to be written to disk. if you have your images on the same disk, or workload writing files to same disk, that fsync will also need to flush everything else at the same time.
you are running etcd on top of a vm that is backed by ceph rbd?
a
rke2 is running on VM's
ceph is the storage layer for the VM's and ceph-csi is the storage class im using
image.png
c
That seems likely to be adding a fair bit of latency. I would probably not run the VMs on replicated storage. etcd handles replication itself, and the nodes should be cattle not pets so you just replace them rather than migrate them around.
oof yeah that is a lot of latency
1
a
stupid question
but right now 3 nodes (combined manager/minion setup)
if i moved the managers to new dedicated metal servers with dedicated storage
and left the minions as is
would etcd just run on the managers?
c
I don’t know what you mean by manager or minion
RKE2 has servers and agents
a
sorry i'm not using the RKE terminology
c
servers run control-plane components and etcd by default, but you can disable that if you want dedicated control-plane or etcd nodes.
1
a
perfect
c
a
yeah just reading that
so i can setup 3 new nodes and move etcd there
c
by default servers will also schedule workload pods, if you want to prevent that you’d need to add a taint to those nodes.
1
a
yep that is no issue
the biggest problem is finding the new hosts / storage haha
c
for etcd you really want 3 or 5 nodes. for the control-plane there’s not much point in having more than 2 because most things only have a single active controller at a time.
if you are going to split the server roles
a
as far as migrating the current setup
i'm assuming i can just join the new nodes to the cluster
increase the size of the etcd cluster
then update the configs on the legacy etcd nodes to remove that role
(with the)
Copy code
disable-etcd: true
c
yeah that should work
1
a
awesome, thanks for your help
h
This is an old doc and the GH links need to be updated but personally I really like this doc about etcd performance testing https://www.suse.com/support/kb/doc/?id=000020100
❤️ 1
💯 1
a
Just wanted to circle back and say after moving etcd to local storage insteada of ceph based, the issue is resolved. Thanks again for all who helped here