Hi all, I have an issue with my backup and recove...
# general
m
Hi all, I have an issue with my backup and recovery on RKE2. The clusters are on prem clusters, all VMs are RHEL8 hosted on VSphere. I tried to do a test recovery on a healthy downstream cluster with an instant local snapshot taken for the test, stop all master nodes and run the rke server
--cluster-reset --cluster-reset-restore-path=<SNAPSHOT-PATH>
in one of the master nodes, and get errors like this during the cluster-reset:
Copy code
INFO[0012] Starting etcd for new cluster
INFO[0012] Tunnel server egress proxy mode: agent
INFO[0012] Tunnel server egress proxy waiting for runtime core to become available
INFO[0012] Server node token is available at /var/lib/rancher/rke2/server/token
INFO[0012] Waiting for cri connection: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /run/k3s/containerd/containerd.sock: connect: connection refused"
INFO[0012] To join server node to cluster: rke2 server -s <https://172.16.187.56:9345> -t ${SERVER_NODE_TOKEN}
INFO[0012] Agent node token is available at /var/lib/rancher/rke2/server/agent-token
INFO[0012] To join agent node to cluster: rke2 agent -s <https://172.16.187.56:9345> -t ${AGENT_NODE_TOKEN}
INFO[0012] Wrote kubeconfig /etc/rancher/rke2/rke2.yaml
INFO[0012] Run: rke2 kubectl
INFO[0012] Running load balancer rke2-agent-load-balancer 127.0.0.1:6444 -> [172.16.187.55:9345]
ERRO[0012] failed to get CA certs: Get "<https://127.0.0.1:6444/cacerts>": read tcp 127.0.0.1:44018->127.0.0.1:6444: read: connection reset by peer
ERRO[0014] failed to get CA certs: Get "<https://127.0.0.1:6444/cacerts>": read tcp 127.0.0.1:44036->127.0.0.1:6444: read: connection reset by peer
ERRO[0016] failed to get CA certs: Get "<https://127.0.0.1:6444/cacerts>": read tcp 127.0.0.1:44058->127.0.0.1:6444: read: connection reset by peer
INFO[0017] Tunnel server egress proxy waiting for runtime core to become available
{"level":"warn","ts":"2023-07-19T16:49:55.067+0200","logger":"etcd-client","caller":"v3@v3.5.4-k3s1/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"<etcd-endpoints://0xc00086ea80/127.0.0.1:2379>","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\""}
WARN[0017] Failed to get apiserver address from etcd: context deadline exceeded
ERRO[0018] failed to get CA certs: Get "<https://127.0.0.1:6444/cacerts>": read tcp 127.0.0.1:44072->127.0.0.1:6444: read: connection reset by peer
ERRO[0020] failed to get CA certs: Get "<https://127.0.0.1:6444/cacerts>": read tcp 127.0.0.1:44078->127.0.0.1:6444: read: connection reset by peer
INFO[0022] Tunnel server egress proxy waiting for runtime core to become available
{"level":"warn","ts":"2023-07-19T16:50:00.068+0200","logger":"etcd-client","caller":"v3@v3.5.4-k3s1/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"<etcd-endpoints://0xc00086ec40/127.0.0.1:2379>","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\""}
WARN[0022] Failed to get apiserver address from etcd: context deadline exceeded
BUT, when doing the same for the recovery of the rancher cluster, it worked just fine with the exact same process. 1. Taking an etcd backup To take a one-time backup in an RKE2 cluster, run the following command:
rke2 etcd-snapshot save
2. Stop rke2-server on all master nodes
systemctl stop rke2-server
3. Reset the cluster and restore the etcd database From one of the master nodes, execute the following command to reset the cluster and restore the etcd database.
rke2 server --cluster-reset --cluster-reset-restore-path=<PATH-TO-SNAPSHOT>
4. Restart the new etcd cluster Once the restore is complete, start the new etcd cluster
systemctl start rke2-server
5. Remove etcd data from other master nodes SSH to the remaining master nodes in the cluster and execute the following command to remove the etcd data stored on the node.
rm -rf /var/lib/rancher/rke2/server/db
6. Restart rke2-server to rejoin the cluster This will trigger a new etcd member to join the etcd cluster and sync the data from the bootstrap node.
systemctl start rke2-server
Any idea what could be the cause ? Posted in #rke2
777 Views