powerful-easter-15334
08/14/2025, 5:58 AMpowerful-easter-15334
08/14/2025, 6:02 AMpowerful-easter-15334
08/14/2025, 6:08 AMpowerful-easter-15334
08/14/2025, 6:09 AM"Failed to test data store connection: failed to defragment etcd database: context deadline exceeded"
"Defragmenting etcd database"
powerful-easter-15334
08/14/2025, 6:16 AMpowerful-easter-15334
08/14/2025, 6:37 AMpowerful-easter-15334
08/14/2025, 6:40 AMpowerful-easter-15334
08/14/2025, 6:41 AMpowerful-easter-15334
08/14/2025, 6:45 AMpowerful-easter-15334
08/14/2025, 6:46 AMEvents:
Normal NodeUnresponsive 57m (x2 over 60m) node-controller virt-handler is not responsive, marking node as unresponsive
powerful-easter-15334
08/14/2025, 7:02 AM"component":"virt-handler"
"level":"error"
"msg":"Unable to mark vmi as unresponsive socket //pods/*/volumes/kubernetes.io~empty-dir/sockets/launcher-sock"
"pos":"cache.go:486"
"reason":"open /pods/*/volumes/kubernetes.io~empty-dir/sockets/launcher-unresponsive: no such file or directory"
So I restarted that pod, and now it's back to working normally. harvester-2 is back to schedulable πpowerful-easter-15334
08/14/2025, 7:17 AMpowerful-easter-15334
08/14/2025, 7:17 AMpowerful-easter-15334
08/14/2025, 7:18 AMpowerful-easter-15334
08/14/2025, 7:21 AMkubectl get engines -n longhorn-system
I can use the above to get the engines and then just use grep to filter for the PVC I'm experimenting on.
In this case, there are indeed 2 engines, one on H2 and one on H3. Was that maybe caused by the attempted migration of VMs off of H2?powerful-easter-15334
08/14/2025, 7:24 AMbrainy-kilobyte-33711
08/14/2025, 7:24 AMkubectl get -A VirtualMachineInstanceMigration
brainy-kilobyte-33711
08/14/2025, 7:25 AMpowerful-easter-15334
08/14/2025, 7:28 AMpowerful-easter-15334
08/14/2025, 7:29 AMbrainy-kilobyte-33711
08/14/2025, 7:35 AMbrainy-kilobyte-33711
08/14/2025, 7:37 AMpowerful-easter-15334
08/14/2025, 7:38 AMpowerful-easter-15334
08/14/2025, 7:38 AMbrainy-kilobyte-33711
08/14/2025, 7:39 AMpowerful-easter-15334
08/14/2025, 7:41 AMk get vmi -n sicorax
NAME AGE PHASE IP NODENAME READY
sicorax 5h39m Scheduling False
It says scheduling but it's still running on H1.
It looks like it's trying to go to H3?
I got this from the pod which is in init 0:3 state
Warning FailedAttachVolume 4m27s (x124 over 5h31m) attachdetach-controller AttachVolume.Attach failed for volume "pvc-5bd3aeca-91da-4426-8a31-d10268cb195e" : rpc error: code = DeadlineExceeded desc = volume pvc-5bd3aeca-91da-4426-8a31-d10268cb195e failed to attach to node harvester-3 with attachmentID csi-615a1e5d177c160a054719e3e4f62aa580373ec48245ce9cb89c403519d2ffae
powerful-easter-15334
08/14/2025, 7:42 AMpowerful-easter-15334
08/14/2025, 7:43 AMbrainy-kilobyte-33711
08/14/2025, 7:44 AMbrainy-kilobyte-33711
08/14/2025, 7:44 AMbrainy-kilobyte-33711
08/14/2025, 7:45 AMbrainy-kilobyte-33711
08/14/2025, 7:46 AMpowerful-easter-15334
08/14/2025, 7:46 AMpowerful-easter-15334
08/14/2025, 7:47 AMbrainy-kilobyte-33711
08/14/2025, 7:47 AMbrainy-kilobyte-33711
08/14/2025, 7:47 AMbrainy-kilobyte-33711
08/14/2025, 7:47 AMpowerful-easter-15334
08/14/2025, 7:49 AMpowerful-easter-15334
08/14/2025, 7:51 AMbrainy-kilobyte-33711
08/14/2025, 7:52 AMetcdctl defrag
commandpowerful-easter-15334
08/14/2025, 7:58 AMbash: etcdctl: command not found
brainy-kilobyte-33711
08/14/2025, 8:01 AMexport CRI_CONFIG_FILE=/var/lib/rancher/rke2/agent/etc/crictl.yaml
etcdcontainer=$(/var/lib/rancher/rke2/bin/crictl ps --label io.kubernetes.container.name=etcd --quiet)
/var/lib/rancher/rke2/bin/crictl exec $etcdcontainer etcdctl --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt endpoint status --cluster --write-out=table
powerful-easter-15334
08/14/2025, 8:03 AMError while dialing: dial unix /var/run/k3s/containerd/containerd.sock: connect: connection refused
And I can't exec into the pod from H2/3
Error from server: error dialing backend: proxy error from 127.0.0.1:9345 while dialing 10.0.1.61:10250, code 502: 502 Bad Gateway
powerful-easter-15334
08/14/2025, 8:08 AMpowerful-easter-15334
08/14/2025, 8:09 AMbrainy-kilobyte-33711
08/14/2025, 8:19 AMpowerful-easter-15334
08/14/2025, 8:20 AMFailed to get the status of endpoint <https://10.0.1.61:2379> (context deadline exceeded)
brainy-kilobyte-33711
08/14/2025, 8:21 AMps aux | grep etcd
can you see etcd running?brainy-kilobyte-33711
08/14/2025, 8:21 AMharvester-node-0-250204:~ # ps aux | grep etcd
root 6833 29.2 0.0 12287836 503936 ? Ssl Feb04 80372:05 etcd --config-file=/var/lib/rancher/rke2/server/db/etcd/config
powerful-easter-15334
08/14/2025, 8:22 AMroot 4051 16.9 0.6 12421172 1733948 ? Ssl Feb06 46030:31 etcd --config-file=/var/lib/rancher/rke2/server/db/etcd/config
powerful-easter-15334
08/14/2025, 8:23 AMbrainy-kilobyte-33711
08/14/2025, 8:23 AMbrainy-kilobyte-33711
08/14/2025, 8:23 AMnsenter --target ETCD_PID --mount --net --pid --uts -- etcdctl \
--cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt \
--key /var/lib/rancher/rke2/server/tls/etcd/server-client.key \
--cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt endpoint status --cluster --write-out=table;
brainy-kilobyte-33711
08/14/2025, 8:23 AMpowerful-easter-15334
08/14/2025, 8:24 AMharvester-1:/home/rancher # nsenter --target 4051 --mount --net --pid --uts -- etcdctl --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt endpoint status --cluster --write-out=table;
{"level":"warn","ts":"2025-08-14T08:23:51.452266Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"<etcd-endpoints://0xc0004c21e0/127.0.0.1:2379>","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Error: failed to fetch endpoints from etcd cluster member list: context deadline exceeded
powerful-easter-15334
08/14/2025, 8:24 AMbrainy-kilobyte-33711
08/14/2025, 8:25 AMbrainy-kilobyte-33711
08/14/2025, 8:27 AMendpoint status --cluster --write-out=table;
with defrag;
and see what happenspowerful-easter-15334
08/14/2025, 8:28 AMFailed to defragment etcd member[127.0.0.1:2379] (context deadline exceeded)
So I should be able to just put a massive timeout and keep my fingers crossedpowerful-easter-15334
08/14/2025, 8:29 AMbrainy-kilobyte-33711
08/14/2025, 8:29 AMsystemctl stop rke2-server
whilst you do thispowerful-easter-15334
08/14/2025, 8:30 AMpowerful-easter-15334
08/14/2025, 10:48 AMpowerful-easter-15334
08/14/2025, 10:51 AMbrainy-kilobyte-33711
08/14/2025, 11:34 AMbrainy-kilobyte-33711
08/14/2025, 11:35 AMpowerful-easter-15334
08/14/2025, 11:36 AMpowerful-easter-15334
08/14/2025, 11:36 AMbrainy-kilobyte-33711
08/14/2025, 11:37 AMkubectl edit Settings overcommit-config
to edit via CLI if UI is downpowerful-easter-15334
08/14/2025, 11:37 AMpowerful-easter-15334
08/14/2025, 4:45 PM+------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| <https://10.0.1.62:2379> | 5300d46da5ef6c19 | 3.5.13 | 1.2 GB | true | false | 23 | 744071023 | 744071023 | |
| <https://10.0.1.61:2379> | bcf7f605752d301a | 3.5.13 | 1.3 GB | true | false | 3 | 16587 | 16587 | |
| <https://10.0.1.63:2379> | 6a64f80a55285718 | 3.5.13 | 1.0 GB | false | false | 23 | 744071030 | 744071030 | |
+------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
Is this normal/to be expected?powerful-easter-15334
08/14/2025, 4:47 PMpowerful-easter-15334
08/14/2025, 4:48 PMNAME STATUS ROLES AGE VERSION
harvester-1 Ready control-plane,etcd,master 268d v1.29.9+rke2r1
harvester-2 NotReady control-plane,etcd,master 188d v1.29.9+rke2r1
harvester-3 NotReady control-plane,etcd,master 92d v1.29.9+rke2r1
And if I run on H2, it shows 2 and 3 ready but not 1powerful-easter-15334
08/14/2025, 4:55 PMINFO[0081] Managed etcd cluster membership has been reset, restart without --cluster-reset flag now. Backup and delete ${datadir}/server/db on each peer etcd server and rejoin the nodes
When running the migration, it says this at the end. I was expecting that it would rejoin the cluster with H2 and H3 and then update the etcd db on H1?powerful-easter-15334
08/14/2025, 6:27 PM