This message was deleted.
# harvester
a
This message was deleted.
🙌 1
c
is this an instance running with in a vm on harvester?
g
Copy code
How to recover:
1. go to another node to check the etcd cluster status, for example goto the harvester-node-1
$ for etcdpod in $(kubectl -n kube-system get pod -l component=etcd --no-headers -o custom-columns=NAME:.metadata.name); do kubectl -n kube-system exec $etcdpod -- sh -c "ETCDCTL_ENDPOINTS='<https://127.0.0.1:2379>' ETCDCTL_CACERT='/var/lib/rancher/rke2/server/tls/etcd/server-ca.crt' ETCDCTL_CERT='/var/lib/rancher/rke2/server/tls/etcd/server-client.crt' ETCDCTL_KEY='/var/lib/rancher/rke2/server/tls/etcd/server-client.key' ETCDCTL_API=3 etcdctl endpoint health"; done
//output
<https://127.0.0.1:2379> is healthy: successfully committed proposal: took = 12.182346ms
<https://127.0.0.1:2379> is healthy: successfully committed proposal: took = 3.337284ms                                                                                                          Error from server: error dialing backend: proxy error from 127.0.0.1:9345 while dialing 192.168.0.32:10250, code 503: 503 Service Unavailable

$ /var/lib/rancher/rke2/bin/crictl exec $etcdcontainer sh -c "ETCDCTL_ENDPOINTS='<https://127.0.0.1:2379>' ETCDCTL_CACERT='/var/lib/rancher/rke2/server/tls/etcd/server-ca.crt' ETCDCTL_CERT='/var/lib/rancher/rke2/server/tls/etcd/server-client.crt' ETCDCTL_KEY='/var/lib/rancher/rke2/server/tls/etcd/server-client.key' ETCDCTL_API=3 etcdctl member list -w table"
+------------------+---------+---------------------------+---------------------------+---------------------------+------------+
|        ID        | STATUS  |           NAME            |        PEER ADDRS         |       CLIENT ADDRS        | IS LEARNER |                                                 
+------------------+---------+---------------------------+---------------------------+---------------------------+------------+                                                                
| 2fb80ba72cbc8dbf | started | harvester-node-1-90e941aa | <https://192.168.0.31:2380> | <https://192.168.0.31:2379> |      false |
| a1b3c1454f4ab0e1 | started | harvester-node-2-a20577c5 | <https://192.168.0.32:2380> | <https://192.168.0.32:2379> |      false |
| adc70370cdccfae7 | started | harvester-node-0-db70dc2f | <https://192.168.0.30:2380> | <https://192.168.0.30:2379> |      false |
+------------------+---------+---------------------------+---------------------------+---------------------------+------------+

Please record the broken node Name, ID, and PEER ADDRS (e.g. ID: a1b3c1454f4ab0e1, Name: harvester-node-2-a20577c5, PEER ADDRS: <https://192.168.0.32:2380>)
2. remove the db as clean on the harvester-node-2
$ systemctl stop rke2-server.service  <-- make sure the etcd is stopped
$ cd /var/lib/rancher/rke2/server/db/etcd  <-- goto etcd folder
$ mv member/ /root/member-bak <-- backup the db, the member contains the wal and snap files
$ ls
config  name  <-- config can be reused, keep
3. go back harvester-node-1, remove the harvester-node-2 from cluster and check
// remove the harvester-node-2-a20577c5 on etcd cluster
// remember etcdctl member remove need ID (e.g. a1b3c1454f4ab0e1) to remove
$ /var/lib/rancher/rke2/bin/crictl exec $etcdcontainer sh -c "ETCDCTL_ENDPOINTS='<https://127.0.0.1:2379>' ETCDCTL_CACERT='/var/lib/rancher/rke2/server/tls/etcd/server-ca.crt' ETCDCTL_CERT='/var/lib/rancher/rke2/server/tls/etcd/server-client.crt' ETCDCTL_KEY='/var/lib/rancher/rke2/server/tls/etcd/server-client.key' ETCDCTL_API=3 etcdctl member remove a1b3c1454f4ab0e1"

Member a1b3c1454f4ab0e1 removed from cluster 6f68b64b4b6d89ef

// check the current cluster, harvester-node-2 should be removed
$ /var/lib/rancher/rke2/bin/crictl exec $etcdcontainer sh -c "ETCDCTL_ENDPOINTS='<https://127.0.0.1:2379>' ETCDCTL_CACERT='/var/lib/rancher/rke2/server/tls/etcd/server-ca.crt' ETCDCTL_CERT='/var/lib/rancher/rke2/server/tls/etcd/server-client.crt' ETCDCTL_KEY='/var/lib/rancher/rke2/server/tls/etcd/server-client.key' ETCDCTL_API=3 etcdctl member list -w table"

+------------------+---------+---------------------------+---------------------------+---------------------------+------------+
|        ID        | STATUS  |           NAME            |        PEER ADDRS         |       CLIENT ADDRS        | IS LEARNER |
+------------------+---------+---------------------------+---------------------------+---------------------------+------------+                                                                
| 2fb80ba72cbc8dbf | started | harvester-node-1-90e941aa | <https://192.168.0.31:2380> | <https://192.168.0.31:2379> |      false |
| adc70370cdccfae7 | started | harvester-node-0-db70dc2f | <https://192.168.0.30:2380> | <https://192.168.0.30:2379> |      false |
+------------------+---------+---------------------------+---------------------------+---------------------------+------------+
4. add the harvester-node-2 back, and check the status is unstarted (needs to be run on the first node where the remove was executed from)
// add the harvester-node-2 back, you need node name and peer addr
$ /var/lib/rancher/rke2/bin/crictl exec $etcdcontainer sh -c "ETCDCTL_ENDPOINTS='<https://127.0.0.1:2379>' ETCDCTL_CACERT='/var/lib/rancher/rke2/server/tls/etcd/server-ca.crt' ETCDCTL_CERT='/var/lib/rancher/rke2/server/tls/etcd/server-client.crt' ETCDCTL_KEY='/var/lib/rancher/rke2/server/tls/etcd/server-client.key' ETCDCTL_API=3 etcdctl member add harvester-node-
2-a20577c5 --peer-urls=<https://192.168.0.32:2380>"
Member 2aa8330fb817c14d added to cluster 6f68b64b4b6d89ef

ETCD_NAME="harvester-node-2-a20577c5"
ETCD_INITIAL_CLUSTER="harvester-node-2-a20577c5=<https://192.168.0.32:2380>,harvester-node-1-90e941aa=<https://192.168.0.31:2380>,harvester-node-0-db70dc2f=<https://192.168.0.30:2380>"
ETCD_INITIAL_ADVERTISE_PEER_URLS="<https://192.168.0.32:2380>"
ETCD_INITIAL_CLUSTER_STATE="existing"

// check the cluster status, harvester-node-2 should be added back but not started
$ /var/lib/rancher/rke2/bin/crictl exec $etcdcontainer sh -c "ETCDCTL_ENDPOINTS='<https://127.0.0.1:2379>' ETCDCTL_CACERT='/var/lib/rancher/rke2/server/tls/etcd/server-ca.crt' ETCDCTL_CERT='/var/lib/rancher/rke2/server/tls/etcd/server-client.crt' ETCDCTL_KEY='/var/lib/rancher/rke2/server/tls/etcd/server-client.key' ETCDCTL_API=3 etcdctl member list -w table"
+------------------+-----------+---------------------------+---------------------------+---------------------------+------------+
|        ID        |  STATUS   |           NAME            |        PEER ADDRS         |       CLIENT ADDRS        | IS LEARNER |
+------------------+-----------+---------------------------+---------------------------+---------------------------+------------+
| 2aa8330fb817c14d | unstarted |                           | <https://192.168.0.32:2380> |                           |      false |
| 2fb80ba72cbc8dbf |   started | harvester-node-1-90e941aa | <https://192.168.0.31:2380> | <https://192.168.0.31:2379> |      false |
| adc70370cdccfae7 |   started | harvester-node-0-db70dc2f | <https://192.168.0.30:2380> | <https://192.168.0.30:2379> |      false |
+------------------+-----------+---------------------------+---------------------------+---------------------------+------------+
5. go to harvester-node-2, start the rke2-server
$ systemctl start rke2-server
6. it should start without error, then we can check the etcd cluster again
$ /var/lib/rancher/rke2/bin/crictl exec $etcdcontainer sh -c "ETCDCTL_ENDPOINTS='<https://127.0.0.1:2379>' ETCDCTL_CACERT='/var/lib/rancher/rke2/server/tls/etcd/server-ca.crt' ETCDCTL_CERT='/var/lib/rancher/rke2/server/tls/etcd/server-client.crt' ETCDCTL_KEY='/var/lib/rancher/rke2/server/tls/etcd/server-client.key' ETCDCTL_API=3 etcdctl member list -w table"
+------------------+---------+---------------------------+---------------------------+---------------------------+------------+
|        ID        | STATUS  |           NAME            |        PEER ADDRS         |       CLIENT ADDRS        | IS LEARNER |
+------------------+---------+---------------------------+---------------------------+---------------------------+------------+
| 2aa8330fb817c14d | started | harvester-node-2-a20577c5 | <https://192.168.0.32:2380> | <https://192.168.0.32:2379> |      false |
| 2fb80ba72cbc8dbf | started | harvester-node-1-90e941aa | <https://192.168.0.31:2380> | <https://192.168.0.31:2379> |      false |
| adc70370cdccfae7 | started | harvester-node-0-db70dc2f | <https://192.168.0.30:2380> | <https://192.168.0.30:2379> |      false |
+------------------+---------+---------------------------+---------------------------+---------------------------+------------+
j
It's real hardware, namely three large HP racked servers with each 32 cores, 96GB memory, a 500GB SAS array for Harvester and 7TB SSD array for Longhorn. All storage is battery-backed but the filesystem was corrupted anyway. I guess the cache wasn't enough for the wal journal to be written down after shutdown.
Thank you! I could have never guessed. I'll try this today.
There is no 'sh' or 'bash' in $PATH on those pods, apparently. But maybe the env is already setup as needed? I'll try running etcdctl directly.
Copy code
for etcdpod in $(kubectl -n kube-system get pod -l component=etcd --no-headers -o custom-columns=NAME:.metadata.name); do kubectl -n kube-system exec $etcdpod -- etcdctl endpoint health; done                                                                                               {"level":"warn","ts":"2024-11-27T05:54:02.146243Z","logger":"client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"<etcd-endpoints://0xc000526000/127.0.0.1:2379|etcd-endpoints://0xc000526000/127.0.0.1:2379>","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"error reading server preface: read tcp 127.0.0.1:53020->127.0.0.1:2379: read: connection reset by peer\""}                                            127.0.0.1:2379 is unhealthy: failed to commit proposal: context deadline exceeded                 Error: unhealthy cluster                                                                          command terminated with exit code 1                                                               Error from server: error dialing backend: proxy error from 127.0.0.1:9345 while dialing 192.168.1.190:10250, code 502: 502 Bad Gateway                                                              {"level":"warn","ts":"2024-11-27T05:54:07.99899Z","logger":"client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"<etcd-endpoints://0xc0003ee000/127.0.0.1:2379|etcd-endpoints://0xc0003ee000/127.0.0.1:2379>","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"error reading server preface: EOF\""}                127.0.0.1:2379 is unhealthy: failed to commit proposal: context deadline exceeded
Error: unhealthy cluster                                                                          command terminated with exit code 1
I'm pretty sure two etcd nodes are running fine, the primary just complains about not being able to reach the third node... But the first command fails anyway.
The particular thing is that the failing node was the first node, where the cluster was created. I noticed it's still specified as the go-to node for initialisation in every config file. Maybe that's why by default etcdctl goes there?
g
you could do to the second node to find the members
i am assuming your cluster had 2 other nodes promoted
else you can delete this node and wipe it and add it back as a new node
j
I was on the second node, the one with a broken etcd can't run kubectl. The third node has the same output, but etcd must be running OK because I have massive workload and lots of VMs running without any problem on the two other nodes. I think etcdctl in each pod is just trying to reach the broken node instead of their own etcd. I'll investigate the pod env and see if it matches the one suggested above.
Ok, I found how to provide the parameters to etcdctl. Here is an adaptation of your first two commands from node “quo”, which is a good one. The broken node is “qui” at 192.168.1.190. Now I’ll adapt your instructions and try the full procedure.
Copy code
quo:~ # kubectl -n kube-system exec etcd-quo -- etcdctl --endpoints 127.0.0.1:2379 --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key member list -w table
+------------------+---------+--------------+----------------------------+----------------------------+------------+
|        ID        | STATUS  |     NAME     |         PEER ADDRS         |        CLIENT ADDRS        | IS LEARNER |
+------------------+---------+--------------+----------------------------+----------------------------+------------+
| 1b8b6a40e1cfc42b | started | quo-ff1cd5e5 | <https://192.168.1.192:2380> | <https://192.168.1.192:2379> |      false |
| 8d7ce6d289374fa9 | started | qua-0cae4173 | <https://192.168.1.193:2380> | <https://192.168.1.193:2379> |      false |
| caaab7311c583503 | started | qui-316cad6d | <https://192.168.1.190:2380> | <https://192.168.1.190:2379> |      false |
+------------------+---------+--------------+----------------------------+----------------------------+------------+
I’m stuck starting rke2-server. Even when I change the “config” file in /var/lib/rancher/rke2/server/db/etcd to use the right initial-cluster and initial-cluster-state, rke2-server resets it to the original ones from when the cluster needed to be created from scratch. If I’m right, I have to find out where rke2-server finds that information and override it. Or, I can try to use chattr +i and make the config file immutable…
When the config file is immutable, rke2-server doesn’t want to start. 🙄 Now, where does rke2 takes its configuration from?
Copy code
Nov 27 08:29:21 qui rke2[23728]: time="2024-11-27T08:29:21Z" level=fatal msg="starting kubernetes: preparing server: start managed database: open /var/lib/rancher/rke2/server/db/etcd/config: operation not permitted"
I found this, could it be the source? I’ll try to change that to the necessary configuration.
Copy code
/var/lib/rancher/rke2/agent/pod-manifests/etcd.yaml:5:    <http://etcd.k3s.io/initial|etcd.k3s.io/initial>: '{"initial-advertise-peer-urls":"<https://192.168.1.190:2380>","initial-cluster":"qui-316cad6d=<https://192.168.1.190:2380>","initial-cluster-state":"new"}'
It needs to be like this:
Copy code
initial-cluster: quo-ff1cd5e5=<https://192.168.1.192:2380>,qui-316cad6d=<https://192.168.1.190:2380>,qua-0cae4173=<https://192.168.1.193:2380>
initial-cluster-state: existing
In the other nodes I see it’s actually just
<http://etcd.k3s.io/initial|etcd.k3s.io/initial>: '{}'
I’ll go with that and hope the config file doesn’t get rewritten.
That didn’t work, but I got it anyway… by injecting a correct copy of the etcd config file every 100ms… not the most elegant solution, but it worked. Thanks for your help!
Copy code
while (true); do cat config.bkp > config; sleep 0.1; done &