I appear to be having some etcd issues, don't know...
# harvester
p
I appear to be having some etcd issues, don't know if someone can give a quick hand.
Two days ago, my cluster "stopped" for 11 minutes, on Grafana, all 3 nodes stopped posting updates.
image.png
This caused some of my VMs to move or restart, and the rke2 leader was changed.
For 6 of those minutes, the logs on H1 are:
Copy code
541121Z","logger":"etcd-client","caller":"v3@v3.5.13-k3s1/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"<etcd-endpoints://0xc0009e41e0/1>>","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = received context error while waiting for new LB policy update: context deadline exceeded
Or the same thing with a different message.
Copy code
desc = context deadline exceeded"
On the second server H2 (which I believe was the leader) showed the same logs, as well as:
Copy code
Sep 16 23:38:16 harvester-2 rke2[20131]: time="2025-09-16T23:35:00Z" level=error msg="Failed to check local etcd status for learner management: context deadline exceeded"
Though, Harvester 3 had no interesting logs at that point in time, despite also not having any data entries for that time slot. I guess maybe this is because the metrics server wasn't working with 2 of 3 nodes knocked out.
HOWEVER, it somehow gets worse
24-ish hours later, Harvester 3 goes out. H2 and H1 stay fine in this case. H3 was the rke2 leader this time (picked up from H2 since the previous (above) case) This case has the following logs:
Copy code
917081Z","logger":"etcd-client","caller":"v3@v3.5.13-k3s1/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"<etcd-endpoints://0xc003a9a000/1>>
g="Failed to check local etcd status for learner management: context deadline exceeded"
="error in remotedialer server [400]: websocket: close 1006 (abnormal closure): unexpected EOF"
="error in remotedialer server [400]: websocket: close 1006 (abnormal closure): unexpected EOF"
="Stopped tunnel to 10.0.1.63:9345"
="Proxy done" err="context canceled" url="<wss://10.0.1.63:9345/v1-rke2/connect>"
="error in remotedialer server [400]: websocket: close 1006 (abnormal closure): unexpected EOF"
g="leaderelection lost for rke2-etcd"
ode=exited, status=1/FAILURE
So now I'm wondering where the issue could stem from - is it the database size (1.6GB), disk slowness (they are 10k SAS HDDs in RAID1), or network issues.
I also get a fair few of the following error messages though they don't seem to cause or be caused by my etcd issues.
Copy code
msg="Proxy error: write failed: write tcp 127.0.0.1:9345->127.0.0.1:35390: write: connection reset by peer"
They do appear fairly often in the logs though ๐Ÿ˜•
Any tips to prevent my cluster from imploding yet again would be cool ๐Ÿ™
Oh, my etcd db was 1.7GB. I ran a defrag across all my nodes, I brought it down to 50MB. I am a little bit freaked out over the reduction in size. Will see if this happens again
b
disk slowness (they are 10k SAS HDDs in RAID1)
Highly likely it's the disk.
p
Alright, thanks a ton. I'll bump that up my priority chain for upgrades. At least the defrag brought down the size dramatically, so I think that should help for a bit.
b
yeah you can run that fairly often with no downsides.
It's basically just leaving stale data on the disk, but with spinners it'll affect the response times by quite a bit.
p
Okay okay. Actually I think the etcd had been inflated because of a backup issue which is being fixed in 1.5.2.
Ahhhh, that makes sense
b
Just don't be surprised if it jumps up to something like 500mb by monday.
p
Oh yeah, that's fine. I might set a cron job to run it semi-frequently
Thank you for your advice and opinion on the root cause!!
b
when you switch to ssd/nvme just be sure to turn off the cron as it'll wear out your disk.
p
When I swap disks, I'll have to reinstall Harvester on that node anyways
But yep, will keep that in mind
b
I was thinking you were triggering it via ansible or something.
Let me pull my script I made for this... hang on.
Copy code
#!/bin/bash

etcdnode=$(kubectl -n kube-system get pod -l component=etcd --no-headers -o custom-columns=NAME:.metadata.name | head -1)

echo "Getting etcd Status"

kubectl -n kube-system exec -it ${etcdnode} -- etcdctl --endpoints 127.0.0.1:2379 --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key endpoint status --cluster -w table

echo "Defragging the etcd in the current cluster via ${etcdnode}"

kubectl -n kube-system exec -it ${etcdnode} -- etcdctl --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt defrag --cluster

echo "Getting etcd Health"

kubectl -n kube-system exec -it ${etcdnode} -- etcdctl --endpoints 127.0.0.1:2379 --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key endpoint health --cluster -w table

echo "Getting etcd Status"

kubectl -n kube-system exec -it ${etcdnode} -- etcdctl --endpoints 127.0.0.1:2379 --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key endpoint status --cluster -w table
The health check should show the latency (
TOOK
) to be less than 15ms. When we were running on spinners it was like 200+
๐Ÿ˜… 1
p
Why Ansible when little bash do the job XD
Thank you for the script, I'll give the cluster a thorough check tomorrow!
b
kubectl -n kube-system exec -it ${etcdnode} -- etcdctl --endpoints 127.0.0.1:2379 --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key endpoint health --cluster -w table
is just the health part
โค๏ธ 1
p
Ohhhhhhhh the whole Harvester cluster in Ansible. I didn't think that far
b
but also ansible because OS is immutable.
Don't want to lose the cron because it rebooted
p
OHHHHHHHH
Yeah, that's true. Thank you, I had forgotten that
b
There's ways around that, but they're all painful
Much easier for AWX or something external to run things like that.
p
The over-engineering part of me thinks of a daemonset with privileges and so on and so on
b
Yeah, you could do a cron job
p
I'll do just bash for now ahaha Thank you though!
b
but then there's security for getting the kube config...
Yeah just running it once in a while from your laptop should be fine
๐Ÿ˜‚ 1
p
Who doesn't like RBAC here XD
s
Thanks for that! I just ran it on my cluster and went from 500MB to 59MB.