Hello I discovered a number of issues with my cluster this m Rancher Users #harvester

Hello! I discovered a number of issues with my clu...

powerful-easter-15334

08/14/2025, 5:58 AM

Hello! I discovered a number of issues with my cluster this morning. H1 (I have 3 nodes H1-3) is NotReady - rke2-server is not running because of failed etcd defragmentation. While that is something I will have to investigate and fix, my priority is ensuring the VMs are running. The load on H3 has increased as a result of VMs moving over. However there are a number of VMs stuck in "Scheduling" phase, which are not ready. I assume these VMs were intended to move to H2. When I checked why the migrations are not migrating, I noticed H2 has the label kubevirt.io/schedulable=false. I tried changing it to true but it returns immediately to false.

powerful-easter-15334

08/14/2025, 6:02 AM

Checking on Longhorn (which is undoubtedly having a crisis right now) I see a few volumes which are attached to virt-launcher pods on both H2 and H3. When I open them, I get: failed to list snapshot: cannot get client for volume pvc-eb34daca-fa4c-407f-8bbf-3dccf503e005 on node harvester-2: more than one engine found for volume pvc-eb34daca-fa4c-407f-8bbf-3dccf503e005 And indeed, I see 4 "replicas" H1 replica - red (stopped) H2 replica - blue (running) H3 replica - gray (stopped) H2 replica - gray (running)

powerful-easter-15334

08/14/2025, 6:08 AM

I don't think the storage scheduling is an issue - H2 has some 3TB which it can allocate...

powerful-easter-15334

08/14/2025, 6:09 AM

Oh and this is what I get from the rke2-server status on H1

Copy code

"Failed to test data store connection: failed to defragment etcd database: context deadline exceeded"
"Defragmenting etcd database"

powerful-easter-15334

08/14/2025, 6:16 AM

At this point, I would probably have tried restarting the rke2-server process, just to see if that may help. However, the VMs which are "Pending" are still running on H1 and I do not want to risk them being stopped and not starting back up again.

powerful-easter-15334

08/14/2025, 6:37 AM

This is what happened to H1 around the time of the issue. I don't know which part of this could be attributed to the etcd defragmentation issue - though I noticed the "user" CPU time dropped from around 30% to 10%

powerful-easter-15334

08/14/2025, 6:40 AM

And while, undoubtedly, the etcd issue is a problem on H1, H2 has its own set of problems. For example, why is the kubevirt.io/schedulable set to false? The plot thickens. You can visibly see the point at which VMs probably got migrated off, with that sudden massive drop in memory use. H2 is also receiving a lot of data on the mgmt-br and bo lines - traffic coming from H3.

powerful-easter-15334

08/14/2025, 6:41 AM

But looking at H3, I can tell that this node does not care in the slightest about the strokes which the other two nodes had

powerful-easter-15334

08/14/2025, 6:45 AM

So far, what can I say happened. At 6am, H2 hit something which caused it to become unschedulable for VMs. However, this does not affect normal pods which keep running normally. And around 5:45(?) H1 entered its etcd defragmentation loop. Whether H1 caused H2 to become unschedulable is, for now, unknown to me. It would be ideal to know what is making H2 unschedulable and why.

powerful-easter-15334

08/14/2025, 6:46 AM

Ah, on the latter point, I forgot I can just describe a node.

Copy code

Events:
Normal NodeUnresponsive 57m (x2 over 60m) node-controller virt-handler is not responsive, marking node as unresponsive

powerful-easter-15334

08/14/2025, 7:02 AM

The virt-handler pod was "running" but had the following in the logs:

Copy code

"component":"virt-handler"
"level":"error"
"msg":"Unable to mark vmi as unresponsive socket //pods/*/volumes/kubernetes.io~empty-dir/sockets/launcher-sock"
"pos":"cache.go:486"
"reason":"open /pods/*/volumes/kubernetes.io~empty-dir/sockets/launcher-unresponsive: no such file or directory"

So I restarted that pod, and now it's back to working normally. harvester-2 is back to schedulable 🎉

🎉 2

powerful-easter-15334

08/14/2025, 7:17 AM

(this thread is hopefully going to be useful to people in the future if they ever have the misfortune of waking up to a messed up cluster like this)

powerful-easter-15334

08/14/2025, 7:17 AM

Anyway, more news. While H2 is back to schedulable, my VMs are not moving.

powerful-easter-15334

08/14/2025, 7:18 AM

In some cases, I see "no engine running" on H2 and in other cases "2 engines running" on H2. Weird. (also this is in the case of Longhorn)

powerful-easter-15334

08/14/2025, 7:21 AM

Copy code

kubectl get engines -n longhorn-system

I can use the above to get the engines and then just use grep to filter for the PVC I'm experimenting on. In this case, there are indeed 2 engines, one on H2 and one on H3. Was that maybe caused by the attempted migration of VMs off of H2?

powerful-easter-15334

08/14/2025, 7:24 AM

That may actually be the case as if I get the volume in the lh-system namespace and output as yaml, I see: currentMigrationNodeID: harvester-3 currentNodeID: harvester-2 So probably that engine on H3 was caused by the attempted migration. Why the migration didn't go through is another thing but I'm not too concerned as H3 hit some 75% RAM usage with the refugee VMs so the reserved RAM is probably at its limit. I am very grateful to H3 for welcoming the outcast VMs 😊

brainy-kilobyte-33711

08/14/2025, 7:24 AM

If you do a

kubectl get -A VirtualMachineInstanceMigration

brainy-kilobyte-33711

08/14/2025, 7:25 AM

do you have a lot of in progress ones?

powerful-easter-15334

08/14/2025, 7:28 AM

Nopes, none in progress. So I guess those which could migrate did migrate and the others failed (or just disappeared). I was going through this: https://longhorn.io/kb/troubleshooting-two-active-engines/#workaround And so I assume I should try deleting the engine for H3 (as current node is H2)

powerful-easter-15334

08/14/2025, 7:29 AM

In fact, the engine on H3 dates back to exactly 6am, when the RAM usage dropped on H2. So it most definitely stems from the VM exodus from H2.

brainy-kilobyte-33711

08/14/2025, 7:35 AM

If the VMs are running, might be worth trying to get h1 back. I'm sure a migration from 2 or 3 -> 1 would clear up the engine problem

brainy-kilobyte-33711

08/14/2025, 7:37 AM

What disks do your nodes have (ssd, nvme, hdd)

powerful-easter-15334

08/14/2025, 7:38 AM

Alright. This is my concern. I have a VM which HR uses and I would rather not get yelled at this close to Friday. My biggest worry - if I restart or affect the rke2-server service, will the VMs, which have been orphaned on H1, be restarted or stopped? I'm worried they get stopped and don't come back up.

powerful-easter-15334

08/14/2025, 7:38 AM

Sadly, the nodes are on HDDs. SSD storage for servers is extremely expensive in this part of the world 😞

brainy-kilobyte-33711

08/14/2025, 7:39 AM

oh sorry I didn't realise VMs were still lingering on h1 now that h2 was back.

powerful-easter-15334

08/14/2025, 7:41 AM

Yeah, bizarrely

Copy code

k get vmi -n sicorax 
NAME     AGE    PHASE       IP   NODENAME  READY 
sicorax  5h39m  Scheduling                   False

It says scheduling but it's still running on H1. It looks like it's trying to go to H3? I got this from the pod which is in init 0:3 state

Copy code

Warning  FailedAttachVolume  4m27s (x124 over 5h31m)  attachdetach-controller  AttachVolume.Attach failed for volume "pvc-5bd3aeca-91da-4426-8a31-d10268cb195e" : rpc error: code = DeadlineExceeded desc = volume pvc-5bd3aeca-91da-4426-8a31-d10268cb195e failed to attach to node harvester-3 with attachmentID csi-615a1e5d177c160a054719e3e4f62aa580373ec48245ce9cb89c403519d2ffae

powerful-easter-15334

08/14/2025, 7:42 AM

I guess actually, that since H1's rke2-server service isn't running, it didn't even get the signal to stop the VM for it to be moved to H3.

powerful-easter-15334

08/14/2025, 7:43 AM

I would love to try and restart rke2-server on H1, that might "clear things up" for the etcd defrag loop. But again, I want my VMs to stay running. Maybe I should do this after work hours 😂

brainy-kilobyte-33711

08/14/2025, 7:44 AM

the etcd defrag on startup has a 30s timeout and I can't see a flag to disable that behaviour. There might be issues with the HDD never being able to do it in 30s

brainy-kilobyte-33711

08/14/2025, 7:44 AM

not sure if it "does a bit" on each run and will eventually work or the whole operation cancels

brainy-kilobyte-33711

08/14/2025, 7:45 AM

is etcd still running on the node?

brainy-kilobyte-33711

08/14/2025, 7:46 AM

if so you might be able to defrag it outside of rke2/k3s and let it take as long as it needs

powerful-easter-15334

08/14/2025, 7:46 AM

Which is weird. Because it has been fine in the past. Maybe a sudden change caused when H2 went down, and H1 wasn't able to keep up.

powerful-easter-15334

08/14/2025, 7:47 AM

I've never interacted with the etcd part directly. Give me a second, I'll find the docs

brainy-kilobyte-33711

08/14/2025, 7:47 AM

this is helpful https://www.suse.com/support/kb/doc/?id=000021653

brainy-kilobyte-33711

08/14/2025, 7:47 AM

but hopefully you can tweak one of the commands to defrag and then rke2 will start

brainy-kilobyte-33711

08/14/2025, 7:47 AM

probably take something like 31 seconds....

powerful-easter-15334

08/14/2025, 7:49 AM

Thank you! The etcd is 745MB on H2 and H3. On H1, it's a whole 1.6GB 👀 I can see why it's timing out now ahaha

powerful-easter-15334

08/14/2025, 7:51 AM

The docs have a lot for checking. I wonder, is there a way for me to run the etcd defrag from the CLI manually...

brainy-kilobyte-33711

08/14/2025, 7:52 AM

I think there is

etcdctl defrag

command

powerful-easter-15334

08/14/2025, 7:58 AM

I thought etcdctl would be there, but it seem Harvester does not come with that tool?

Copy code

bash: etcdctl: command not found

brainy-kilobyte-33711

08/14/2025, 8:01 AM

you'll need to drop into the etcd pod if it's running

Copy code

export CRI_CONFIG_FILE=/var/lib/rancher/rke2/agent/etc/crictl.yaml
etcdcontainer=$(/var/lib/rancher/rke2/bin/crictl ps --label io.kubernetes.container.name=etcd --quiet)
/var/lib/rancher/rke2/bin/crictl exec $etcdcontainer etcdctl --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt endpoint status --cluster --write-out=table

powerful-easter-15334

08/14/2025, 8:03 AM

Crictl won't run on H1

Copy code

Error while dialing: dial unix /var/run/k3s/containerd/containerd.sock: connect: connection refused

And I can't exec into the pod from H2/3

Copy code

Error from server: error dialing backend: proxy error from 127.0.0.1:9345 while dialing 10.0.1.61:10250, code 502: 502 Bad Gateway

powerful-easter-15334

08/14/2025, 8:08 AM

Since the etcd endpoint is supposedly accessible on <nodeip>:2379 I guess I could copy the certificates to my laptop, and run etcdctl from my laptop itself... ouch

powerful-easter-15334

08/14/2025, 8:09 AM

Just one question though. Since the VMs are still running on H1, that means there is a LH replica on H1 (or the data is being sent to the other replicas on other nodes). Which means that when H1 comes back, the "new" data would be synced to the other nodes, right? At least, that's my logic for it

brainy-kilobyte-33711

08/14/2025, 8:19 AM

I think so?

powerful-easter-15334

08/14/2025, 8:20 AM

I'm running etcdctl from my laptop Just to get endpoint status/health, it's timing out.

Copy code

Failed to get the status of endpoint <https://10.0.1.61:2379> (context deadline exceeded)

brainy-kilobyte-33711

08/14/2025, 8:21 AM

on the node if you do

ps aux | grep etcd

can you see etcd running?

brainy-kilobyte-33711

08/14/2025, 8:21 AM

e.g. we have

Copy code

harvester-node-0-250204:~ # ps aux | grep etcd
root       6833 29.2  0.0 12287836 503936 ?     Ssl  Feb04 80372:05 etcd --config-file=/var/lib/rancher/rke2/server/db/etcd/config

powerful-easter-15334

08/14/2025, 8:22 AM

At least it's there

Copy code

root      4051 16.9  0.6 12421172 1733948 ?    Ssl  Feb06 46030:31 etcd --config-file=/var/lib/rancher/rke2/server/db/etcd/config

powerful-easter-15334

08/14/2025, 8:23 AM

I think the most miserable way of fixing this would be taking a backup of etcd on H2/3 and restoring it on H1

brainy-kilobyte-33711

08/14/2025, 8:23 AM

that's good, we can hopefully enter the pod namespaces

brainy-kilobyte-33711

08/14/2025, 8:23 AM

Copy code

nsenter --target ETCD_PID --mount  --net --pid --uts -- etcdctl \
  --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt \
  --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key \
  --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt endpoint status  --cluster --write-out=table;

brainy-kilobyte-33711

08/14/2025, 8:23 AM

does that work?

powerful-easter-15334

08/14/2025, 8:24 AM

Sort of. It times out, but it works

Copy code

harvester-1:/home/rancher # nsenter --target 4051 --mount  --net --pid --uts -- etcdctl   --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt   --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key   --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt endpoint status  --cluster --write-out=table;
{"level":"warn","ts":"2025-08-14T08:23:51.452266Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"<etcd-endpoints://0xc0004c21e0/127.0.0.1:2379>","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Error: failed to fetch endpoints from etcd cluster member list: context deadline exceeded

powerful-easter-15334

08/14/2025, 8:24 AM

Should be able to add --command-timeout=300s or something to there

brainy-kilobyte-33711

08/14/2025, 8:25 AM

I can't remember exactly how nsenter works and whether you need to run it everytime or now you are in the namespace and can just run the etcdctl command

brainy-kilobyte-33711

08/14/2025, 8:27 AM

maybe just try replacing

endpoint status  --cluster --write-out=table;

with

defrag;

and see what happens

powerful-easter-15334

08/14/2025, 8:28 AM

Went very fast to

Copy code

Failed to defragment etcd member[127.0.0.1:2379] (context deadline exceeded)

So I should be able to just put a massive timeout and keep my fingers crossed

powerful-easter-15334

08/14/2025, 8:29 AM

Though with rke2 also defrag-looping in the background, that might make it a bit harder. I'm not sure etcd would appreciate two defrag operations at the same time haha

brainy-kilobyte-33711

08/14/2025, 8:29 AM

true, you can probably do a

systemctl stop rke2-server

whilst you do this

powerful-easter-15334

08/14/2025, 8:30 AM

If I have to do that, I'd rather run it past 5pm just so I don't risk turning off the important VM. I'm running the defrag with a 30min timeout and will have a quick lunch

🤞 1

powerful-easter-15334

08/14/2025, 10:48 AM

I have been cursed. Even on an hour of timeout, it still did not go through. I might have to try an etcd restoration on H1, using a snapshot from H2/3. I do believe my data would be fine as etcd won't affect the longhorn replicas. Either way, I will have to try a proper restart of the rke2-server service out of hours.

powerful-easter-15334

08/14/2025, 10:51 AM

Oh, also, memory has been allocated to 114% on H3, which, I think, explains why the other VMs are still pending - they were supposed to be moved to H3. I would have thought a VM restart would work, alas, deleting the pod does not work. And the Harvester UI is down 😕

brainy-kilobyte-33711

08/14/2025, 11:34 AM

Oh dear. I think harvester overcommits RAM by 50% by default.

brainy-kilobyte-33711

08/14/2025, 11:35 AM

Good luck!

powerful-easter-15334

08/14/2025, 11:36 AM

Oh, so that's why. Again, I don't have the UI to edit that overcommit, but I'm just using 70% right now, so no need to be alarmed yet. Looking forward to the etcd restore 😅

👍 1

powerful-easter-15334

08/14/2025, 11:36 AM

Thank you for all your help!!

brainy-kilobyte-33711

08/14/2025, 11:37 AM

you can do a

kubectl edit Settings overcommit-config

to edit via CLI if UI is down

powerful-easter-15334

08/14/2025, 11:37 AM

Ah that's the way to do it. Thank you!

powerful-easter-15334

08/14/2025, 4:45 PM

Sorry, me again. Just restored etcd on my first node from a snapshot. kubectl get nodes still shows as "not ready" and for the etcd:

Copy code

+------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|        ENDPOINT        |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| <https://10.0.1.62:2379> | 5300d46da5ef6c19 |  3.5.13 |  1.2 GB |      true |      false |        23 |  744071023 |          744071023 |        |
| <https://10.0.1.61:2379> | bcf7f605752d301a |  3.5.13 |  1.3 GB |      true |      false |         3 |      16587 |              16587 |        |
| <https://10.0.1.63:2379> | 6a64f80a55285718 |  3.5.13 |  1.0 GB |     false |      false |        23 |  744071030 |          744071030 |        |
+------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

Is this normal/to be expected?

powerful-easter-15334

08/14/2025, 4:47 PM

I assume it's reconciling and catching up with everything which happened in 20 hours.

powerful-easter-15334

08/14/2025, 4:48 PM

Oooooh nevermind, that does not seem right. Ran on recovered (H1)

Copy code

NAME          STATUS     ROLES                       AGE    VERSION
harvester-1   Ready      control-plane,etcd,master   268d   v1.29.9+rke2r1
harvester-2   NotReady   control-plane,etcd,master   188d   v1.29.9+rke2r1
harvester-3   NotReady   control-plane,etcd,master   92d    v1.29.9+rke2r1

And if I run on H2, it shows 2 and 3 ready but not 1

powerful-easter-15334

08/14/2025, 4:55 PM

Copy code

INFO[0081] Managed etcd cluster membership has been reset, restart without --cluster-reset flag now. Backup and delete ${datadir}/server/db on each peer etcd server and rejoin the nodes

When running the migration, it says this at the end. I was expecting that it would rejoin the cluster with H2 and H3 and then update the etcd db on H1?

powerful-easter-15334

08/14/2025, 6:27 PM

At the end of the day (and what a day!) - Harvester synced all my volumes and recreated replicas of my LH volumes. All VMs got migrated to H2/H3. The H1 node is empty. I tried deleting etcd on H1 completely so that it might try synchronising with the etcds on H2/H3 rather than create a new cluster from the etcd snapshot. And, worst, case, I'll just reinstall Harvester on H1 - treat the cluster as cattle, not pets 😄 It is pretty cool that Harvester got back up despite the catastrophic events on H1 and H2.

🎉 1

47 Views

Open in Slack

Previous Next