Hello! I discovered a number of issues with my clu...
# harvester
p
Hello! I discovered a number of issues with my cluster this morning. H1 (I have 3 nodes H1-3) is NotReady - rke2-server is not running because of failed etcd defragmentation. While that is something I will have to investigate and fix, my priority is ensuring the VMs are running. The load on H3 has increased as a result of VMs moving over. However there are a number of VMs stuck in "Scheduling" phase, which are not ready. I assume these VMs were intended to move to H2. When I checked why the migrations are not migrating, I noticed H2 has the label kubevirt.io/schedulable=false. I tried changing it to true but it returns immediately to false.
Checking on Longhorn (which is undoubtedly having a crisis right now) I see a few volumes which are attached to virt-launcher pods on both H2 and H3. When I open them, I get: failed to list snapshot: cannot get client for volume pvc-eb34daca-fa4c-407f-8bbf-3dccf503e005 on node harvester-2: more than one engine found for volume pvc-eb34daca-fa4c-407f-8bbf-3dccf503e005 And indeed, I see 4 "replicas" H1 replica - red (stopped) H2 replica - blue (running) H3 replica - gray (stopped) H2 replica - gray (running)
I don't think the storage scheduling is an issue - H2 has some 3TB which it can allocate...
Oh and this is what I get from the rke2-server status on H1
Copy code
"Failed to test data store connection: failed to defragment etcd database: context deadline exceeded"
"Defragmenting etcd database"
At this point, I would probably have tried restarting the rke2-server process, just to see if that may help. However, the VMs which are "Pending" are still running on H1 and I do not want to risk them being stopped and not starting back up again.
This is what happened to H1 around the time of the issue. I don't know which part of this could be attributed to the etcd defragmentation issue - though I noticed the "user" CPU time dropped from around 30% to 10%
And while, undoubtedly, the etcd issue is a problem on H1, H2 has its own set of problems. For example, why is the kubevirt.io/schedulable set to false? The plot thickens. You can visibly see the point at which VMs probably got migrated off, with that sudden massive drop in memory use. H2 is also receiving a lot of data on the mgmt-br and bo lines - traffic coming from H3.
But looking at H3, I can tell that this node does not care in the slightest about the strokes which the other two nodes had
So far, what can I say happened. At 6am, H2 hit something which caused it to become unschedulable for VMs. However, this does not affect normal pods which keep running normally. And around 5:45(?) H1 entered its etcd defragmentation loop. Whether H1 caused H2 to become unschedulable is, for now, unknown to me. It would be ideal to know what is making H2 unschedulable and why.
Ah, on the latter point, I forgot I can just describe a node.
Copy code
Events:
Normal NodeUnresponsive 57m (x2 over 60m) node-controller virt-handler is not responsive, marking node as unresponsive
The virt-handler pod was "running" but had the following in the logs:
Copy code
"component":"virt-handler"
"level":"error"
"msg":"Unable to mark vmi as unresponsive socket //pods/*/volumes/kubernetes.io~empty-dir/sockets/launcher-sock"
"pos":"cache.go:486"
"reason":"open /pods/*/volumes/kubernetes.io~empty-dir/sockets/launcher-unresponsive: no such file or directory"
So I restarted that pod, and now it's back to working normally. harvester-2 is back to schedulable πŸŽ‰
πŸŽ‰ 2
(this thread is hopefully going to be useful to people in the future if they ever have the misfortune of waking up to a messed up cluster like this)
Anyway, more news. While H2 is back to schedulable, my VMs are not moving.
In some cases, I see "no engine running" on H2 and in other cases "2 engines running" on H2. Weird. (also this is in the case of Longhorn)
Copy code
kubectl get engines -n longhorn-system
I can use the above to get the engines and then just use grep to filter for the PVC I'm experimenting on. In this case, there are indeed 2 engines, one on H2 and one on H3. Was that maybe caused by the attempted migration of VMs off of H2?
That may actually be the case as if I get the volume in the lh-system namespace and output as yaml, I see: currentMigrationNodeID: harvester-3 currentNodeID: harvester-2 So probably that engine on H3 was caused by the attempted migration. Why the migration didn't go through is another thing but I'm not too concerned as H3 hit some 75% RAM usage with the refugee VMs so the reserved RAM is probably at its limit. I am very grateful to H3 for welcoming the outcast VMs 😊
b
If you do a
kubectl get -A VirtualMachineInstanceMigration
do you have a lot of in progress ones?
p
Nopes, none in progress. So I guess those which could migrate did migrate and the others failed (or just disappeared). I was going through this: https://longhorn.io/kb/troubleshooting-two-active-engines/#workaround And so I assume I should try deleting the engine for H3 (as current node is H2)
In fact, the engine on H3 dates back to exactly 6am, when the RAM usage dropped on H2. So it most definitely stems from the VM exodus from H2.
b
If the VMs are running, might be worth trying to get h1 back. I'm sure a migration from 2 or 3 -> 1 would clear up the engine problem
What disks do your nodes have (ssd, nvme, hdd)
p
Alright. This is my concern. I have a VM which HR uses and I would rather not get yelled at this close to Friday. My biggest worry - if I restart or affect the rke2-server service, will the VMs, which have been orphaned on H1, be restarted or stopped? I'm worried they get stopped and don't come back up.
Sadly, the nodes are on HDDs. SSD storage for servers is extremely expensive in this part of the world 😞
b
oh sorry I didn't realise VMs were still lingering on h1 now that h2 was back.
p
Yeah, bizarrely
Copy code
k get vmi -n sicorax 
NAME     AGE    PHASE       IP   NODENAME  READY 
sicorax  5h39m  Scheduling                   False
It says scheduling but it's still running on H1. It looks like it's trying to go to H3? I got this from the pod which is in init 0:3 state
Copy code
Warning  FailedAttachVolume  4m27s (x124 over 5h31m)  attachdetach-controller  AttachVolume.Attach failed for volume "pvc-5bd3aeca-91da-4426-8a31-d10268cb195e" : rpc error: code = DeadlineExceeded desc = volume pvc-5bd3aeca-91da-4426-8a31-d10268cb195e failed to attach to node harvester-3 with attachmentID csi-615a1e5d177c160a054719e3e4f62aa580373ec48245ce9cb89c403519d2ffae
I guess actually, that since H1's rke2-server service isn't running, it didn't even get the signal to stop the VM for it to be moved to H3.
I would love to try and restart rke2-server on H1, that might "clear things up" for the etcd defrag loop. But again, I want my VMs to stay running. Maybe I should do this after work hours πŸ˜‚
b
the etcd defrag on startup has a 30s timeout and I can't see a flag to disable that behaviour. There might be issues with the HDD never being able to do it in 30s
not sure if it "does a bit" on each run and will eventually work or the whole operation cancels
is etcd still running on the node?
if so you might be able to defrag it outside of rke2/k3s and let it take as long as it needs
p
Which is weird. Because it has been fine in the past. Maybe a sudden change caused when H2 went down, and H1 wasn't able to keep up.
I've never interacted with the etcd part directly. Give me a second, I'll find the docs
b
but hopefully you can tweak one of the commands to defrag and then rke2 will start
probably take something like 31 seconds....
p
Thank you! The etcd is 745MB on H2 and H3. On H1, it's a whole 1.6GB πŸ‘€ I can see why it's timing out now ahaha
The docs have a lot for checking. I wonder, is there a way for me to run the etcd defrag from the CLI manually...
b
I think there is
etcdctl defrag
command
p
I thought etcdctl would be there, but it seem Harvester does not come with that tool?
Copy code
bash: etcdctl: command not found
b
you'll need to drop into the etcd pod if it's running
Copy code
export CRI_CONFIG_FILE=/var/lib/rancher/rke2/agent/etc/crictl.yaml
etcdcontainer=$(/var/lib/rancher/rke2/bin/crictl ps --label io.kubernetes.container.name=etcd --quiet)
/var/lib/rancher/rke2/bin/crictl exec $etcdcontainer etcdctl --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt endpoint status --cluster --write-out=table
p
Crictl won't run on H1
Copy code
Error while dialing: dial unix /var/run/k3s/containerd/containerd.sock: connect: connection refused
And I can't exec into the pod from H2/3
Copy code
Error from server: error dialing backend: proxy error from 127.0.0.1:9345 while dialing 10.0.1.61:10250, code 502: 502 Bad Gateway
Since the etcd endpoint is supposedly accessible on <nodeip>:2379 I guess I could copy the certificates to my laptop, and run etcdctl from my laptop itself... ouch
Just one question though. Since the VMs are still running on H1, that means there is a LH replica on H1 (or the data is being sent to the other replicas on other nodes). Which means that when H1 comes back, the "new" data would be synced to the other nodes, right? At least, that's my logic for it
b
I think so?
p
I'm running etcdctl from my laptop Just to get endpoint status/health, it's timing out.
Copy code
Failed to get the status of endpoint <https://10.0.1.61:2379> (context deadline exceeded)
b
on the node if you do
ps aux | grep etcd
can you see etcd running?
e.g. we have
Copy code
harvester-node-0-250204:~ # ps aux | grep etcd
root       6833 29.2  0.0 12287836 503936 ?     Ssl  Feb04 80372:05 etcd --config-file=/var/lib/rancher/rke2/server/db/etcd/config
p
At least it's there
Copy code
root      4051 16.9  0.6 12421172 1733948 ?    Ssl  Feb06 46030:31 etcd --config-file=/var/lib/rancher/rke2/server/db/etcd/config
I think the most miserable way of fixing this would be taking a backup of etcd on H2/3 and restoring it on H1
b
that's good, we can hopefully enter the pod namespaces
Copy code
nsenter --target ETCD_PID --mount  --net --pid --uts -- etcdctl \
  --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt \
  --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key \
  --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt endpoint status  --cluster --write-out=table;
does that work?
p
Sort of. It times out, but it works
Copy code
harvester-1:/home/rancher # nsenter --target 4051 --mount  --net --pid --uts -- etcdctl   --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt   --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key   --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt endpoint status  --cluster --write-out=table;
{"level":"warn","ts":"2025-08-14T08:23:51.452266Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"<etcd-endpoints://0xc0004c21e0/127.0.0.1:2379>","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Error: failed to fetch endpoints from etcd cluster member list: context deadline exceeded
Should be able to add --command-timeout=300s or something to there
b
I can't remember exactly how nsenter works and whether you need to run it everytime or now you are in the namespace and can just run the etcdctl command
maybe just try replacing
endpoint status  --cluster --write-out=table;
with
defrag;
and see what happens
p
Went very fast to
Copy code
Failed to defragment etcd member[127.0.0.1:2379] (context deadline exceeded)
So I should be able to just put a massive timeout and keep my fingers crossed
Though with rke2 also defrag-looping in the background, that might make it a bit harder. I'm not sure etcd would appreciate two defrag operations at the same time haha
b
true, you can probably do a
systemctl stop rke2-server
whilst you do this
p
If I have to do that, I'd rather run it past 5pm just so I don't risk turning off the important VM. I'm running the defrag with a 30min timeout and will have a quick lunch
🀞 1
I have been cursed. Even on an hour of timeout, it still did not go through. I might have to try an etcd restoration on H1, using a snapshot from H2/3. I do believe my data would be fine as etcd won't affect the longhorn replicas. Either way, I will have to try a proper restart of the rke2-server service out of hours.
Oh, also, memory has been allocated to 114% on H3, which, I think, explains why the other VMs are still pending - they were supposed to be moved to H3. I would have thought a VM restart would work, alas, deleting the pod does not work. And the Harvester UI is down πŸ˜•
b
Oh dear. I think harvester overcommits RAM by 50% by default.
Good luck!
p
Oh, so that's why. Again, I don't have the UI to edit that overcommit, but I'm just using 70% right now, so no need to be alarmed yet. Looking forward to the etcd restore πŸ˜…
πŸ‘ 1
Thank you for all your help!!
b
you can do a
kubectl edit Settings overcommit-config
to edit via CLI if UI is down
p
Ah that's the way to do it. Thank you!
Sorry, me again. Just restored etcd on my first node from a snapshot. kubectl get nodes still shows as "not ready" and for the etcd:
Copy code
+------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|        ENDPOINT        |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| <https://10.0.1.62:2379> | 5300d46da5ef6c19 |  3.5.13 |  1.2 GB |      true |      false |        23 |  744071023 |          744071023 |        |
| <https://10.0.1.61:2379> | bcf7f605752d301a |  3.5.13 |  1.3 GB |      true |      false |         3 |      16587 |              16587 |        |
| <https://10.0.1.63:2379> | 6a64f80a55285718 |  3.5.13 |  1.0 GB |     false |      false |        23 |  744071030 |          744071030 |        |
+------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
Is this normal/to be expected?
I assume it's reconciling and catching up with everything which happened in 20 hours.
Oooooh nevermind, that does not seem right. Ran on recovered (H1)
Copy code
NAME          STATUS     ROLES                       AGE    VERSION
harvester-1   Ready      control-plane,etcd,master   268d   v1.29.9+rke2r1
harvester-2   NotReady   control-plane,etcd,master   188d   v1.29.9+rke2r1
harvester-3   NotReady   control-plane,etcd,master   92d    v1.29.9+rke2r1
And if I run on H2, it shows 2 and 3 ready but not 1
Copy code
INFO[0081] Managed etcd cluster membership has been reset, restart without --cluster-reset flag now. Backup and delete ${datadir}/server/db on each peer etcd server and rejoin the nodes
When running the migration, it says this at the end. I was expecting that it would rejoin the cluster with H2 and H3 and then update the etcd db on H1?
At the end of the day (and what a day!) - Harvester synced all my volumes and recreated replicas of my LH volumes. All VMs got migrated to H2/H3. The H1 node is empty. I tried deleting etcd on H1 completely so that it might try synchronising with the etcds on H2/H3 rather than create a new cluster from the etcd snapshot. And, worst, case, I'll just reinstall Harvester on H1 - treat the cluster as cattle, not pets πŸ˜„ It is pretty cool that Harvester got back up despite the catastrophic events on H1 and H2.
πŸŽ‰ 1