The disk clone completed successfully, I was to ab...
# longhorn-storage
j
The disk clone completed successfully, I was to able to swap the original disk which I then removed and the VM is working fine. I would have loved to understand this better to solve it instead of this dirty workaround, if anyone has pointers if it happens again I’m all ears. Thanks!
b
sometimes pods don't get cleared out the way they should.
By that, I mean you'll have a something like
virt-launcher-vm-1c
and when you migrate, there's
virt-launcher-vm-2x
running on the new node.
Instead of it deleting the 1c pod it gets stale for whatever reason and sticks around and longhorn gets hung up because it's "using" the pvc/volume.
We've also had times where you can force that deleting volume attachment by getting rid of the finalizer, but it's generally not the safest plan.
We've also had nodes get stuck in bad states where the kernel is wonky and rebooting it fixed it too.
There's a lot of root causes that can cause the symptom in my experience.
j
I see, and I’ve seen almost every case you described, but this was different. I made sure there was no virt-launcher pod running in any node, the vmi status was either Succeeded or inexistent (when I removed it manually) and the vm was Stopped, and I could force the removal of the attachment from the UI or my deleting the volumeattachment resource - but it was just created back by something. I believe it’s a bug in a longhorn controller or maybe in harvester, but I have found no log mentioning the attachment instruction origin - just the event happening. I don’t how to trace etcd’s object creation to spot where it’s coming from (yet) in the control plane, I plan to do that next time.
b
You can defrag etcd too. sometimes that helps
j
I’ve played with various etcd flavours successfully, but RKE2 and k3s’ ones are hard to manipulate, and I don’t want to play in my production cluster.
b
the k3s I haven't figured out yet because it doesn't break it out into another process.
but the RKE one is straighforward
I have a script... hang on.
Copy code
#!/bin/bash

etcdnode=$(kubectl -n kube-system get pod -l component=etcd --no-headers -o custom-columns=NAME:.metadata.name | head -1)

echo "Getting etcd Status"

kubectl -n kube-system exec -it ${etcdnode} -- etcdctl --endpoints 127.0.0.1:2379 --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key endpoint status --cluster -w table

echo "Defragging the etcd in the current cluster via ${etcdnode}"

kubectl -n kube-system exec -it ${etcdnode} -- etcdctl --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt defrag --cluster

echo "Getting etcd Health"

kubectl -n kube-system exec -it ${etcdnode} -- etcdctl --endpoints 127.0.0.1:2379 --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key endpoint health --cluster -w table

echo "Getting etcd Status"

kubectl -n kube-system exec -it ${etcdnode} -- etcdctl --endpoints 127.0.0.1:2379 --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key endpoint status --cluster -w table
It works on harvester clusters
j
Thanks! If you ever need to troubleshoot with k3s’s etcd db I have a few notes from my last crash & rebuild.
b
😉
hopefully it'll never come to that.
🤞 1
lol
but the output looks like this:
Copy code
bmonroe@orden:~/Projects/rancher-debugging$ kubectl get nodes
NAME   STATUS   ROLES                       AGE    VERSION
dh1    Ready    control-plane,etcd,master   125d   v1.30.7+rke2r1
dh2    Ready    <none>                      89d    v1.30.7+rke2r1
dh3    Ready    control-plane,etcd,master   124d   v1.30.7+rke2r1
dh4    Ready    control-plane,etcd,master   124d   v1.30.7+rke2r1
bmonroe@orden:~/Projects/rancher-debugging$ ./etcd-defrag.sh 
Getting etcd Status
+-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|          ENDPOINT           |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| <https://128.111.126.73:2379> | 43c449ab5873ac62 |  3.5.16 |  104 MB |      true |      false |         8 |  217156303 |          217156303 |        |
| <https://128.111.126.71:2379> | 5b59076779648677 |  3.5.16 |  104 MB |     false |      false |         8 |  217156303 |          217156303 |        |
| <https://128.111.126.74:2379> | 8ba8fa961fae49d4 |  3.5.16 |  104 MB |     false |      false |         8 |  217156303 |          217156303 |        |
+-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
Defragging the etcd in the current cluster via etcd-dh1
Finished defragmenting etcd member[<https://128.111.126.73:2379>]
Finished defragmenting etcd member[<https://128.111.126.71:2379>]
Finished defragmenting etcd member[<https://128.111.126.74:2379>]
Getting etcd Health
+-----------------------------+--------+------------+-------+
|          ENDPOINT           | HEALTH |    TOOK    | ERROR |
+-----------------------------+--------+------------+-------+
| <https://128.111.126.71:2379> |   true | 4.053719ms |       |
| <https://128.111.126.73:2379> |   true | 6.320412ms |       |
| <https://128.111.126.74:2379> |   true | 6.578674ms |       |
+-----------------------------+--------+------------+-------+
Getting etcd Status
+-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|          ENDPOINT           |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| <https://128.111.126.73:2379> | 43c449ab5873ac62 |  3.5.16 |   55 MB |      true |      false |         8 |  217156398 |          217156398 |        |
| <https://128.111.126.71:2379> | 5b59076779648677 |  3.5.16 |   55 MB |     false |      false |         8 |  217156398 |          217156398 |        |
| <https://128.111.126.74:2379> | 8ba8fa961fae49d4 |  3.5.16 |   55 MB |     false |      false |         8 |  217156398 |          217156398 |        |
+-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
j
Thanks, I'll try that with a fresh mind tomorrow.