most-sunset-36476
07/19/2023, 3:43 PM--cluster-reset --cluster-reset-restore-path=<SNAPSHOT-PATH>
in one of the master nodes, and get errors like this during the cluster-reset:
INFO[0012] Starting etcd for new cluster
INFO[0012] Tunnel server egress proxy mode: agent
INFO[0012] Tunnel server egress proxy waiting for runtime core to become available
INFO[0012] Server node token is available at /var/lib/rancher/rke2/server/token
INFO[0012] Waiting for cri connection: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /run/k3s/containerd/containerd.sock: connect: connection refused"
INFO[0012] To join server node to cluster: rke2 server -s <https://172.16.187.56:9345> -t ${SERVER_NODE_TOKEN}
INFO[0012] Agent node token is available at /var/lib/rancher/rke2/server/agent-token
INFO[0012] To join agent node to cluster: rke2 agent -s <https://172.16.187.56:9345> -t ${AGENT_NODE_TOKEN}
INFO[0012] Wrote kubeconfig /etc/rancher/rke2/rke2.yaml
INFO[0012] Run: rke2 kubectl
INFO[0012] Running load balancer rke2-agent-load-balancer 127.0.0.1:6444 -> [172.16.187.55:9345]
ERRO[0012] failed to get CA certs: Get "<https://127.0.0.1:6444/cacerts>": read tcp 127.0.0.1:44018->127.0.0.1:6444: read: connection reset by peer
ERRO[0014] failed to get CA certs: Get "<https://127.0.0.1:6444/cacerts>": read tcp 127.0.0.1:44036->127.0.0.1:6444: read: connection reset by peer
ERRO[0016] failed to get CA certs: Get "<https://127.0.0.1:6444/cacerts>": read tcp 127.0.0.1:44058->127.0.0.1:6444: read: connection reset by peer
INFO[0017] Tunnel server egress proxy waiting for runtime core to become available
{"level":"warn","ts":"2023-07-19T16:49:55.067+0200","logger":"etcd-client","caller":"v3@v3.5.4-k3s1/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"<etcd-endpoints://0xc00086ea80/127.0.0.1:2379>","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\""}
WARN[0017] Failed to get apiserver address from etcd: context deadline exceeded
ERRO[0018] failed to get CA certs: Get "<https://127.0.0.1:6444/cacerts>": read tcp 127.0.0.1:44072->127.0.0.1:6444: read: connection reset by peer
ERRO[0020] failed to get CA certs: Get "<https://127.0.0.1:6444/cacerts>": read tcp 127.0.0.1:44078->127.0.0.1:6444: read: connection reset by peer
INFO[0022] Tunnel server egress proxy waiting for runtime core to become available
{"level":"warn","ts":"2023-07-19T16:50:00.068+0200","logger":"etcd-client","caller":"v3@v3.5.4-k3s1/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"<etcd-endpoints://0xc00086ec40/127.0.0.1:2379>","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\""}
WARN[0022] Failed to get apiserver address from etcd: context deadline exceeded
BUT, when doing the same for the recovery of the rancher cluster, it worked just fine with the exact same process.
1. Taking an etcd backup
To take a one-time backup in an RKE2 cluster, run the following command:
rke2 etcd-snapshot save
2. Stop rke2-server on all master nodes
systemctl stop rke2-server
3. Reset the cluster and restore the etcd database
From one of the master nodes, execute the following command to reset the cluster and restore the etcd database.
rke2 server --cluster-reset --cluster-reset-restore-path=<PATH-TO-SNAPSHOT>
4. Restart the new etcd cluster
Once the restore is complete, start the new etcd cluster
systemctl start rke2-server
5. Remove etcd data from other master nodes
SSH to the remaining master nodes in the cluster and execute the following command to remove the etcd data stored on the node.
rm -rf /var/lib/rancher/rke2/server/db
6. Restart rke2-server to rejoin the cluster
This will trigger a new etcd member to join the etcd cluster and sync the data from the bootstrap node.
systemctl start rke2-server
Any idea what could be the cause ?
Posted in #rke2