Struggle to restore a HA RKE2 cluster (3 server no...
# general
m
Hello, We had a planned downtime in our datacenter and had to bring down an entire RKE2 cluster (rke2 version v1.25.12+rke2r1). Now, we are not able to bring it back up. We followed the procedure here to restore from a snapshot - however this too didn't seem to help. We followed the steps here: https://docs.rke2.io/backup_restore#restoring-a-snapshot-to-existing-nodes Can someone clarify whether, after step 2, an etcd instance should still be running? In our case it is, and because of that step-3 wouldn't complete saying that port 2380 is already in use. Should the rke2 server not attempt to spin up a new etcd? With the assumption that the etcd snapshot is restored, if i did a rke-killall, and attempted to start up rke2 server again, even that doesn't work as etcd refuses to startup. And very oddly there isn't anything in the container logs of etcd either to indicate why it exited.
Copy code
{"level":"info","ts":"2023-11-08T20:46:59.845Z","caller":"membership/cluster.go:278","msg":"recovered/added member from store","cluster-id":"b873ea0e9c7657b","local-member-id":"104e1c3b54d5529d","recovered-remote-peer-id":"104e1c3b54d5529d","recovered-remote-peer-urls":["https://10.188.32.12:2380"]}
{"level":"warn","ts":"2023-11-08T20:46:59.846Z","caller":"auth/store.go:1234","msg":"simple token is not cryptographically signed"}
{"level":"info","ts":"2023-11-08T20:46:59.846Z","caller":"mvcc/kvstore.go:323","msg":"restored last compact revision","meta-bucket-name":"meta","meta-bucket-name-key":"finishedCompactRev","restored-compact-revision":55136791}
{"level":"info","ts":"2023-11-08T20:46:59.904Z","caller":"mvcc/kvstore.go:393","msg":"kvstore restored","current-rev":55146615}
{"level":"info","ts":"2023-11-08T20:46:59.904Z","caller":"etcdserver/quota.go:94","msg":"enabled backend quota with default value","quota-name":"v3-applier","quota-size-bytes":2147483648,"quota-size":"2.1 GB"}
{"level":"info","ts":"2023-11-08T20:46:59.907Z","caller":"etcdserver/corrupt.go:95","msg":"starting initial corruption check","local-member-id":"104e1c3b54d5529d","timeout":"15s"}
{"level":"info","ts":"2023-11-08T20:46:59.923Z","caller":"etcdserver/corrupt.go:165","msg":"initial corruption checking passed; no corruption","local-member-id":"104e1c3b54d5529d"}
{"level":"info","ts":"2023-11-08T20:46:59.923Z","caller":"etcdserver/server.go:854","msg":"starting etcd server","local-member-id":"104e1c3b54d5529d","local-server-version":"3.5.7","cluster-version":"to_be_decided"}
{"level":"info","ts":"2023-11-08T20:46:59.923Z","caller":"etcdserver/server.go:738","msg":"started as single-node; fast-forwarding election ticks","local-member-id":"104e1c3b54d5529d","forward-ticks":9,"forward-duration":"4.5s","election-ticks":10,"election-timeout":"5s"}
{"level":"info","ts":"2023-11-08T20:46:59.923Z","caller":"fileutil/purge.go:44","msg":"started to purge file","dir":"/var/lib/rancher/rke2/server/db/etcd/member/snap","suffix":"snap.db","max":5,"interval":"30s"}
{"level":"info","ts":"2023-11-08T20:46:59.923Z","caller":"fileutil/purge.go:44","msg":"started to purge file","dir":"/var/lib/rancher/rke2/server/db/etcd/member/snap","suffix":"snap","max":5,"interval":"30s"}
{"level":"info","ts":"2023-11-08T20:46:59.923Z","caller":"fileutil/purge.go:44","msg":"started to purge file","dir":"/var/lib/rancher/rke2/server/db/etcd/member/wal","suffix":"wal","max":5,"interval":"30s"}
{"level":"info","ts":"2023-11-08T20:46:59.924Z","caller":"membership/cluster.go:584","msg":"set initial cluster version","cluster-id":"b873ea0e9c7657b","local-member-id":"104e1c3b54d5529d","cluster-version":"3.5"}
{"level":"info","ts":"2023-11-08T20:46:59.924Z","caller":"api/capability.go:75","msg":"enabled capabilities for version","cluster-version":"3.5"}
{"level":"info","ts":"2023-11-08T20:46:59.927Z","caller":"embed/etcd.go:687","msg":"starting with client TLS","tls-info":"cert = /var/lib/rancher/rke2/server/tls/etcd/server-client.crt, key = /var/lib/rancher/rke2/server/tls/etcd/server-client.key, client-cert=, client-key=, trusted-ca = /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt, client-cert-auth = true, crl-file = ","cipher-suites":[]}
{"level":"info","ts":"2023-11-08T20:46:59.927Z","caller":"embed/etcd.go:586","msg":"serving peer traffic","address":"10.188.32.12:2380"}
{"level":"info","ts":"2023-11-08T20:46:59.927Z","caller":"embed/etcd.go:558","msg":"cmux::serve","address":"10.188.32.12:2380"}
{"level":"info","ts":"2023-11-08T20:46:59.927Z","caller":"embed/etcd.go:586","msg":"serving peer traffic","address":"127.0.0.1:2380"}
{"level":"info","ts":"2023-11-08T20:46:59.927Z","caller":"embed/etcd.go:558","msg":"cmux::serve","address":"127.0.0.1:2380"}
{"level":"info","ts":"2023-11-08T20:46:59.927Z","caller":"embed/etcd.go:275","msg":"now serving peer/client/metrics","local-member-id":"104e1c3b54d5529d","initial-advertise-peer-urls":["http://localhost:2380"],"listen-peer-urls":["https://10.188.32.12:2380","https://127.0.0.1:2380"],"advertise-client-urls":["https://10.188.32.12:2379"],"listen-client-urls":["https://10.188.32.12:2379","https://127.0.0.1:2379"],"listen-metrics-urls":["http://127.0.0.1:2381"]}
{"level":"info","ts":"2023-11-08T20:46:59.928Z","caller":"embed/etcd.go:762","msg":"serving metrics","address":"http://127.0.0.1:2381"}
Any pointers on how to go about restoring our cluster?