The disk clone completed successfully I was to able to swap Rancher Users #longhorn-storage

The disk clone completed successfully, I was to ab...

jolly-hospital-5285

05/28/2025, 3:50 PM

The disk clone completed successfully, I was to able to swap the original disk which I then removed and the VM is working fine. I would have loved to understand this better to solve it instead of this dirty workaround, if anyone has pointers if it happens again I’m all ears. Thanks!

bland-article-62755

05/28/2025, 7:50 PM

sometimes pods don't get cleared out the way they should.

bland-article-62755

05/28/2025, 7:52 PM

By that, I mean you'll have a something like

virt-launcher-vm-1c

and when you migrate, there's

virt-launcher-vm-2x

running on the new node.

bland-article-62755

05/28/2025, 7:52 PM

Instead of it deleting the 1c pod it gets stale for whatever reason and sticks around and longhorn gets hung up because it's "using" the pvc/volume.

bland-article-62755

05/28/2025, 7:54 PM

We've also had times where you can force that deleting volume attachment by getting rid of the finalizer, but it's generally not the safest plan.

bland-article-62755

05/28/2025, 7:54 PM

We've also had nodes get stuck in bad states where the kernel is wonky and rebooting it fixed it too.

bland-article-62755

05/28/2025, 7:55 PM

There's a lot of root causes that can cause the symptom in my experience.

jolly-hospital-5285

05/28/2025, 8:51 PM

I see, and I’ve seen almost every case you described, but this was different. I made sure there was no virt-launcher pod running in any node, the vmi status was either Succeeded or inexistent (when I removed it manually) and the vm was Stopped, and I could force the removal of the attachment from the UI or my deleting the volumeattachment resource - but it was just created back by something. I believe it’s a bug in a longhorn controller or maybe in harvester, but I have found no log mentioning the attachment instruction origin - just the event happening. I don’t how to trace etcd’s object creation to spot where it’s coming from (yet) in the control plane, I plan to do that next time.

bland-article-62755

05/28/2025, 8:52 PM

You can defrag etcd too. sometimes that helps

jolly-hospital-5285

05/28/2025, 8:54 PM

I’ve played with various etcd flavours successfully, but RKE2 and k3s’ ones are hard to manipulate, and I don’t want to play in my production cluster.

bland-article-62755

05/28/2025, 8:54 PM

the k3s I haven't figured out yet because it doesn't break it out into another process.

bland-article-62755

05/28/2025, 8:55 PM

but the RKE one is straighforward

bland-article-62755

05/28/2025, 8:55 PM

I have a script... hang on.

bland-article-62755

05/28/2025, 8:55 PM

Copy code

#!/bin/bash

etcdnode=$(kubectl -n kube-system get pod -l component=etcd --no-headers -o custom-columns=NAME:.metadata.name | head -1)

echo "Getting etcd Status"

kubectl -n kube-system exec -it ${etcdnode} -- etcdctl --endpoints 127.0.0.1:2379 --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key endpoint status --cluster -w table

echo "Defragging the etcd in the current cluster via ${etcdnode}"

kubectl -n kube-system exec -it ${etcdnode} -- etcdctl --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt defrag --cluster

echo "Getting etcd Health"

kubectl -n kube-system exec -it ${etcdnode} -- etcdctl --endpoints 127.0.0.1:2379 --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key endpoint health --cluster -w table

echo "Getting etcd Status"

kubectl -n kube-system exec -it ${etcdnode} -- etcdctl --endpoints 127.0.0.1:2379 --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key endpoint status --cluster -w table

bland-article-62755

05/28/2025, 8:55 PM

It works on harvester clusters

jolly-hospital-5285

05/28/2025, 8:56 PM

Thanks! If you ever need to troubleshoot with k3s’s etcd db I have a few notes from my last crash & rebuild.

bland-article-62755

05/28/2025, 8:57 PM

😉

bland-article-62755

05/28/2025, 8:57 PM

hopefully it'll never come to that.

🤞 1

bland-article-62755

05/28/2025, 8:57 PM

lol

bland-article-62755

05/28/2025, 8:58 PM

but the output looks like this:

Copy code

bmonroe@orden:~/Projects/rancher-debugging$ kubectl get nodes
NAME   STATUS   ROLES                       AGE    VERSION
dh1    Ready    control-plane,etcd,master   125d   v1.30.7+rke2r1
dh2    Ready    <none>                      89d    v1.30.7+rke2r1
dh3    Ready    control-plane,etcd,master   124d   v1.30.7+rke2r1
dh4    Ready    control-plane,etcd,master   124d   v1.30.7+rke2r1
bmonroe@orden:~/Projects/rancher-debugging$ ./etcd-defrag.sh 
Getting etcd Status
+-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|          ENDPOINT           |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| <https://128.111.126.73:2379> | 43c449ab5873ac62 |  3.5.16 |  104 MB |      true |      false |         8 |  217156303 |          217156303 |        |
| <https://128.111.126.71:2379> | 5b59076779648677 |  3.5.16 |  104 MB |     false |      false |         8 |  217156303 |          217156303 |        |
| <https://128.111.126.74:2379> | 8ba8fa961fae49d4 |  3.5.16 |  104 MB |     false |      false |         8 |  217156303 |          217156303 |        |
+-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
Defragging the etcd in the current cluster via etcd-dh1
Finished defragmenting etcd member[<https://128.111.126.73:2379>]
Finished defragmenting etcd member[<https://128.111.126.71:2379>]
Finished defragmenting etcd member[<https://128.111.126.74:2379>]
Getting etcd Health
+-----------------------------+--------+------------+-------+
|          ENDPOINT           | HEALTH |    TOOK    | ERROR |
+-----------------------------+--------+------------+-------+
| <https://128.111.126.71:2379> |   true | 4.053719ms |       |
| <https://128.111.126.73:2379> |   true | 6.320412ms |       |
| <https://128.111.126.74:2379> |   true | 6.578674ms |       |
+-----------------------------+--------+------------+-------+
Getting etcd Status
+-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|          ENDPOINT           |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| <https://128.111.126.73:2379> | 43c449ab5873ac62 |  3.5.16 |   55 MB |      true |      false |         8 |  217156398 |          217156398 |        |
| <https://128.111.126.71:2379> | 5b59076779648677 |  3.5.16 |   55 MB |     false |      false |         8 |  217156398 |          217156398 |        |
| <https://128.111.126.74:2379> | 8ba8fa961fae49d4 |  3.5.16 |   55 MB |     false |      false |         8 |  217156398 |          217156398 |        |
+-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

jolly-hospital-5285

05/28/2025, 9:00 PM

Thanks, I'll try that with a fresh mind tomorrow.

Open in Slack

Previous Next