adamant-kite-43734
11/26/2024, 6:21 PMclever-analyst-23771
great-bear-19718
11/26/2024, 11:07 PMHow to recover:
1. go to another node to check the etcd cluster status, for example goto the harvester-node-1
$ for etcdpod in $(kubectl -n kube-system get pod -l component=etcd --no-headers -o custom-columns=NAME:.metadata.name); do kubectl -n kube-system exec $etcdpod -- sh -c "ETCDCTL_ENDPOINTS='<https://127.0.0.1:2379>' ETCDCTL_CACERT='/var/lib/rancher/rke2/server/tls/etcd/server-ca.crt' ETCDCTL_CERT='/var/lib/rancher/rke2/server/tls/etcd/server-client.crt' ETCDCTL_KEY='/var/lib/rancher/rke2/server/tls/etcd/server-client.key' ETCDCTL_API=3 etcdctl endpoint health"; done
//output
<https://127.0.0.1:2379> is healthy: successfully committed proposal: took = 12.182346ms
<https://127.0.0.1:2379> is healthy: successfully committed proposal: took = 3.337284ms Error from server: error dialing backend: proxy error from 127.0.0.1:9345 while dialing 192.168.0.32:10250, code 503: 503 Service Unavailable
$ /var/lib/rancher/rke2/bin/crictl exec $etcdcontainer sh -c "ETCDCTL_ENDPOINTS='<https://127.0.0.1:2379>' ETCDCTL_CACERT='/var/lib/rancher/rke2/server/tls/etcd/server-ca.crt' ETCDCTL_CERT='/var/lib/rancher/rke2/server/tls/etcd/server-client.crt' ETCDCTL_KEY='/var/lib/rancher/rke2/server/tls/etcd/server-client.key' ETCDCTL_API=3 etcdctl member list -w table"
+------------------+---------+---------------------------+---------------------------+---------------------------+------------+
| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER |
+------------------+---------+---------------------------+---------------------------+---------------------------+------------+
| 2fb80ba72cbc8dbf | started | harvester-node-1-90e941aa | <https://192.168.0.31:2380> | <https://192.168.0.31:2379> | false |
| a1b3c1454f4ab0e1 | started | harvester-node-2-a20577c5 | <https://192.168.0.32:2380> | <https://192.168.0.32:2379> | false |
| adc70370cdccfae7 | started | harvester-node-0-db70dc2f | <https://192.168.0.30:2380> | <https://192.168.0.30:2379> | false |
+------------------+---------+---------------------------+---------------------------+---------------------------+------------+
Please record the broken node Name, ID, and PEER ADDRS (e.g. ID: a1b3c1454f4ab0e1, Name: harvester-node-2-a20577c5, PEER ADDRS: <https://192.168.0.32:2380>)
2. remove the db as clean on the harvester-node-2
$ systemctl stop rke2-server.service <-- make sure the etcd is stopped
$ cd /var/lib/rancher/rke2/server/db/etcd <-- goto etcd folder
$ mv member/ /root/member-bak <-- backup the db, the member contains the wal and snap files
$ ls
config name <-- config can be reused, keep
3. go back harvester-node-1, remove the harvester-node-2 from cluster and check
// remove the harvester-node-2-a20577c5 on etcd cluster
// remember etcdctl member remove need ID (e.g. a1b3c1454f4ab0e1) to remove
$ /var/lib/rancher/rke2/bin/crictl exec $etcdcontainer sh -c "ETCDCTL_ENDPOINTS='<https://127.0.0.1:2379>' ETCDCTL_CACERT='/var/lib/rancher/rke2/server/tls/etcd/server-ca.crt' ETCDCTL_CERT='/var/lib/rancher/rke2/server/tls/etcd/server-client.crt' ETCDCTL_KEY='/var/lib/rancher/rke2/server/tls/etcd/server-client.key' ETCDCTL_API=3 etcdctl member remove a1b3c1454f4ab0e1"
Member a1b3c1454f4ab0e1 removed from cluster 6f68b64b4b6d89ef
// check the current cluster, harvester-node-2 should be removed
$ /var/lib/rancher/rke2/bin/crictl exec $etcdcontainer sh -c "ETCDCTL_ENDPOINTS='<https://127.0.0.1:2379>' ETCDCTL_CACERT='/var/lib/rancher/rke2/server/tls/etcd/server-ca.crt' ETCDCTL_CERT='/var/lib/rancher/rke2/server/tls/etcd/server-client.crt' ETCDCTL_KEY='/var/lib/rancher/rke2/server/tls/etcd/server-client.key' ETCDCTL_API=3 etcdctl member list -w table"
+------------------+---------+---------------------------+---------------------------+---------------------------+------------+
| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER |
+------------------+---------+---------------------------+---------------------------+---------------------------+------------+
| 2fb80ba72cbc8dbf | started | harvester-node-1-90e941aa | <https://192.168.0.31:2380> | <https://192.168.0.31:2379> | false |
| adc70370cdccfae7 | started | harvester-node-0-db70dc2f | <https://192.168.0.30:2380> | <https://192.168.0.30:2379> | false |
+------------------+---------+---------------------------+---------------------------+---------------------------+------------+
4. add the harvester-node-2 back, and check the status is unstarted (needs to be run on the first node where the remove was executed from)
// add the harvester-node-2 back, you need node name and peer addr
$ /var/lib/rancher/rke2/bin/crictl exec $etcdcontainer sh -c "ETCDCTL_ENDPOINTS='<https://127.0.0.1:2379>' ETCDCTL_CACERT='/var/lib/rancher/rke2/server/tls/etcd/server-ca.crt' ETCDCTL_CERT='/var/lib/rancher/rke2/server/tls/etcd/server-client.crt' ETCDCTL_KEY='/var/lib/rancher/rke2/server/tls/etcd/server-client.key' ETCDCTL_API=3 etcdctl member add harvester-node-
2-a20577c5 --peer-urls=<https://192.168.0.32:2380>"
Member 2aa8330fb817c14d added to cluster 6f68b64b4b6d89ef
ETCD_NAME="harvester-node-2-a20577c5"
ETCD_INITIAL_CLUSTER="harvester-node-2-a20577c5=<https://192.168.0.32:2380>,harvester-node-1-90e941aa=<https://192.168.0.31:2380>,harvester-node-0-db70dc2f=<https://192.168.0.30:2380>"
ETCD_INITIAL_ADVERTISE_PEER_URLS="<https://192.168.0.32:2380>"
ETCD_INITIAL_CLUSTER_STATE="existing"
// check the cluster status, harvester-node-2 should be added back but not started
$ /var/lib/rancher/rke2/bin/crictl exec $etcdcontainer sh -c "ETCDCTL_ENDPOINTS='<https://127.0.0.1:2379>' ETCDCTL_CACERT='/var/lib/rancher/rke2/server/tls/etcd/server-ca.crt' ETCDCTL_CERT='/var/lib/rancher/rke2/server/tls/etcd/server-client.crt' ETCDCTL_KEY='/var/lib/rancher/rke2/server/tls/etcd/server-client.key' ETCDCTL_API=3 etcdctl member list -w table"
+------------------+-----------+---------------------------+---------------------------+---------------------------+------------+
| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER |
+------------------+-----------+---------------------------+---------------------------+---------------------------+------------+
| 2aa8330fb817c14d | unstarted | | <https://192.168.0.32:2380> | | false |
| 2fb80ba72cbc8dbf | started | harvester-node-1-90e941aa | <https://192.168.0.31:2380> | <https://192.168.0.31:2379> | false |
| adc70370cdccfae7 | started | harvester-node-0-db70dc2f | <https://192.168.0.30:2380> | <https://192.168.0.30:2379> | false |
+------------------+-----------+---------------------------+---------------------------+---------------------------+------------+
5. go to harvester-node-2, start the rke2-server
$ systemctl start rke2-server
6. it should start without error, then we can check the etcd cluster again
$ /var/lib/rancher/rke2/bin/crictl exec $etcdcontainer sh -c "ETCDCTL_ENDPOINTS='<https://127.0.0.1:2379>' ETCDCTL_CACERT='/var/lib/rancher/rke2/server/tls/etcd/server-ca.crt' ETCDCTL_CERT='/var/lib/rancher/rke2/server/tls/etcd/server-client.crt' ETCDCTL_KEY='/var/lib/rancher/rke2/server/tls/etcd/server-client.key' ETCDCTL_API=3 etcdctl member list -w table"
+------------------+---------+---------------------------+---------------------------+---------------------------+------------+
| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER |
+------------------+---------+---------------------------+---------------------------+---------------------------+------------+
| 2aa8330fb817c14d | started | harvester-node-2-a20577c5 | <https://192.168.0.32:2380> | <https://192.168.0.32:2379> | false |
| 2fb80ba72cbc8dbf | started | harvester-node-1-90e941aa | <https://192.168.0.31:2380> | <https://192.168.0.31:2379> | false |
| adc70370cdccfae7 | started | harvester-node-0-db70dc2f | <https://192.168.0.30:2380> | <https://192.168.0.30:2379> | false |
+------------------+---------+---------------------------+---------------------------+---------------------------+------------+
jolly-hospital-5285
11/27/2024, 5:29 AMjolly-hospital-5285
11/27/2024, 5:30 AMjolly-hospital-5285
11/27/2024, 5:52 AMjolly-hospital-5285
11/27/2024, 5:56 AMfor etcdpod in $(kubectl -n kube-system get pod -l component=etcd --no-headers -o custom-columns=NAME:.metadata.name); do kubectl -n kube-system exec $etcdpod -- etcdctl endpoint health; done {"level":"warn","ts":"2024-11-27T05:54:02.146243Z","logger":"client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"<etcd-endpoints://0xc000526000/127.0.0.1:2379|etcd-endpoints://0xc000526000/127.0.0.1:2379>","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"error reading server preface: read tcp 127.0.0.1:53020->127.0.0.1:2379: read: connection reset by peer\""} 127.0.0.1:2379 is unhealthy: failed to commit proposal: context deadline exceeded Error: unhealthy cluster command terminated with exit code 1 Error from server: error dialing backend: proxy error from 127.0.0.1:9345 while dialing 192.168.1.190:10250, code 502: 502 Bad Gateway {"level":"warn","ts":"2024-11-27T05:54:07.99899Z","logger":"client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"<etcd-endpoints://0xc0003ee000/127.0.0.1:2379|etcd-endpoints://0xc0003ee000/127.0.0.1:2379>","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"error reading server preface: EOF\""} 127.0.0.1:2379 is unhealthy: failed to commit proposal: context deadline exceeded
Error: unhealthy cluster command terminated with exit code 1
I'm pretty sure two etcd nodes are running fine, the primary just complains about not being able to reach the third node... But the first command fails anyway.jolly-hospital-5285
11/27/2024, 6:00 AMgreat-bear-19718
11/27/2024, 6:06 AMgreat-bear-19718
11/27/2024, 6:06 AMgreat-bear-19718
11/27/2024, 6:06 AMjolly-hospital-5285
11/27/2024, 7:24 AMjolly-hospital-5285
11/27/2024, 7:52 AMquo:~ # kubectl -n kube-system exec etcd-quo -- etcdctl --endpoints 127.0.0.1:2379 --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key member list -w table
+------------------+---------+--------------+----------------------------+----------------------------+------------+
| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER |
+------------------+---------+--------------+----------------------------+----------------------------+------------+
| 1b8b6a40e1cfc42b | started | quo-ff1cd5e5 | <https://192.168.1.192:2380> | <https://192.168.1.192:2379> | false |
| 8d7ce6d289374fa9 | started | qua-0cae4173 | <https://192.168.1.193:2380> | <https://192.168.1.193:2379> | false |
| caaab7311c583503 | started | qui-316cad6d | <https://192.168.1.190:2380> | <https://192.168.1.190:2379> | false |
+------------------+---------+--------------+----------------------------+----------------------------+------------+
jolly-hospital-5285
11/27/2024, 8:27 AMjolly-hospital-5285
11/27/2024, 8:30 AMNov 27 08:29:21 qui rke2[23728]: time="2024-11-27T08:29:21Z" level=fatal msg="starting kubernetes: preparing server: start managed database: open /var/lib/rancher/rke2/server/db/etcd/config: operation not permitted"
jolly-hospital-5285
11/27/2024, 8:33 AM/var/lib/rancher/rke2/agent/pod-manifests/etcd.yaml:5: <http://etcd.k3s.io/initial|etcd.k3s.io/initial>: '{"initial-advertise-peer-urls":"<https://192.168.1.190:2380>","initial-cluster":"qui-316cad6d=<https://192.168.1.190:2380>","initial-cluster-state":"new"}'
It needs to be like this:
initial-cluster: quo-ff1cd5e5=<https://192.168.1.192:2380>,qui-316cad6d=<https://192.168.1.190:2380>,qua-0cae4173=<https://192.168.1.193:2380>
initial-cluster-state: existing
jolly-hospital-5285
11/27/2024, 8:35 AM<http://etcd.k3s.io/initial|etcd.k3s.io/initial>: '{}'
I’ll go with that and hope the config file doesn’t get rewritten.jolly-hospital-5285
11/27/2024, 8:46 AMwhile (true); do cat config.bkp > config; sleep 0.1; done &