anyone seen this issue apply request took too long from etcd Rancher Users #rke2

anyone seen this issue "apply request took too lon...

aloof-salesclerk-86781

08/12/2025, 8:11 PM

anyone seen this issue "apply request took too long" from etcd in a rancher RKE2 cluster coming from the etcd workload, can i tune the expected-duration value?

creamy-pencil-82913

08/12/2025, 8:22 PM

this generally means your disk is too slow or you have insufficient CPU resources

creamy-pencil-82913

08/12/2025, 8:23 PM

and no, the solution is not to simply tell etcd to expect things to take longer

aloof-salesclerk-86781

08/12/2025, 8:27 PM

is there anywhere i can look in the grafana monitoring which will help me pinpoint what exactly it is causing this

aloof-salesclerk-86781

08/12/2025, 8:29 PM

i've checked the CPU/mem requests on the etcd pods and its well within the limits

creamy-pencil-82913

08/12/2025, 8:29 PM

check your disk io latency stats

creamy-pencil-82913

08/12/2025, 8:30 PM

make sure that you are not using rotational storage, and that etcd is not contesting with your workload IO or image pulls

❤️ 1

aloof-salesclerk-86781

08/12/2025, 8:30 PM

thanks i will investigate further

aloof-salesclerk-86781

08/12/2025, 8:31 PM

i believe its SSDs but i'm running RKE on VM's

aloof-salesclerk-86781

08/12/2025, 8:31 PM

the storage layer is a ceph cluster

creamy-pencil-82913

08/12/2025, 8:31 PM

etcd issues a fsync per write to force all outstanding io to be written to disk. if you have your images on the same disk, or workload writing files to same disk, that fsync will also need to flush everything else at the same time.

creamy-pencil-82913

08/12/2025, 8:32 PM

you are running etcd on top of a vm that is backed by ceph rbd?

aloof-salesclerk-86781

08/12/2025, 8:32 PM

rke2 is running on VM's

aloof-salesclerk-86781

08/12/2025, 8:32 PM

ceph is the storage layer for the VM's and ceph-csi is the storage class im using

aloof-salesclerk-86781

08/12/2025, 8:33 PM

image.png

creamy-pencil-82913

08/12/2025, 8:34 PM

That seems likely to be adding a fair bit of latency. I would probably not run the VMs on replicated storage. etcd handles replication itself, and the nodes should be cattle not pets so you just replace them rather than migrate them around.

creamy-pencil-82913

08/12/2025, 8:34 PM

oof yeah that is a lot of latency

✅ 1

aloof-salesclerk-86781

08/12/2025, 8:36 PM

stupid question

aloof-salesclerk-86781

08/12/2025, 8:36 PM

but right now 3 nodes (combined manager/minion setup)

aloof-salesclerk-86781

08/12/2025, 8:36 PM

if i moved the managers to new dedicated metal servers with dedicated storage

aloof-salesclerk-86781

08/12/2025, 8:37 PM

and left the minions as is

aloof-salesclerk-86781

08/12/2025, 8:37 PM

would etcd just run on the managers?

creamy-pencil-82913

08/12/2025, 8:37 PM

I don’t know what you mean by manager or minion

creamy-pencil-82913

08/12/2025, 8:38 PM

RKE2 has servers and agents

aloof-salesclerk-86781

08/12/2025, 8:38 PM

sorry i'm not using the RKE terminology

creamy-pencil-82913

08/12/2025, 8:38 PM

servers run control-plane components and etcd by default, but you can disable that if you want dedicated control-plane or etcd nodes.

✅ 1

aloof-salesclerk-86781

08/12/2025, 8:38 PM

perfect

creamy-pencil-82913

08/12/2025, 8:39 PM

see https://docs.rke2.io/install/server_roles

aloof-salesclerk-86781

08/12/2025, 8:39 PM

yeah just reading that

aloof-salesclerk-86781

08/12/2025, 8:39 PM

so i can setup 3 new nodes and move etcd there

creamy-pencil-82913

08/12/2025, 8:39 PM

by default servers will also schedule workload pods, if you want to prevent that you’d need to add a taint to those nodes.

✅ 1

aloof-salesclerk-86781

08/12/2025, 8:40 PM

yep that is no issue

aloof-salesclerk-86781

08/12/2025, 8:40 PM

the biggest problem is finding the new hosts / storage haha

creamy-pencil-82913

08/12/2025, 8:41 PM

for etcd you really want 3 or 5 nodes. for the control-plane there’s not much point in having more than 2 because most things only have a single active controller at a time.

creamy-pencil-82913

08/12/2025, 8:41 PM

if you are going to split the server roles

aloof-salesclerk-86781

08/12/2025, 8:41 PM

as far as migrating the current setup

aloof-salesclerk-86781

08/12/2025, 8:41 PM

i'm assuming i can just join the new nodes to the cluster

aloof-salesclerk-86781

08/12/2025, 8:41 PM

increase the size of the etcd cluster

aloof-salesclerk-86781

08/12/2025, 8:42 PM

then update the configs on the legacy etcd nodes to remove that role

aloof-salesclerk-86781

08/12/2025, 8:42 PM

(with the)

Copy code

disable-etcd: true

creamy-pencil-82913

08/12/2025, 8:42 PM

yeah that should work

✅ 1

aloof-salesclerk-86781

08/12/2025, 8:45 PM

awesome, thanks for your help

hundreds-evening-84071

08/12/2025, 9:38 PM

This is an old doc and the GH links need to be updated but personally I really like this doc about etcd performance testing https://www.suse.com/support/kb/doc/?id=000020100

❤️ 1

💯 1

aloof-salesclerk-86781

09/04/2025, 8:47 PM

Just wanted to circle back and say after moving etcd to local storage insteada of ceph based, the issue is resolved. Thanks again for all who helped here

26 Views

Open in Slack

Previous Next