Hello, does anyone had issues restoring RKE2 manag...
# rke2
e
Hello, does anyone had issues restoring RKE2 managed by Rancher cluster after losing all control-plane nodes? I am following this doc https://ranchermanager.docs.rancher.com/how-to-guides/new-user-guides/backup-restore-and-dis[…]tore-rancher-launched-kubernetes-clusters-from-backup 1. I have removed all CP machines 2. I have executed join command on a new node with all the roles selected and new node gets stuck with
rancher-system-agent.service
not being able to bootstrap the node here is what I see in
rancher-system-agent
logs
Copy code
May 15 17:39:55 ip-172-23-107-94 rancher-system-agent[1291]: W0515 17:39:55.025335    1291 reflector.go:492] pkg/mod/k8s.io/client-go@v0.32.2/tools/cache/reflector.go:251: watch of *v1.Secret ended with: an error on the server ("unable to decode an event from the watch stream: stream error: stream ID 13; INTERNAL_ERROR; received from peer") has prevented the request from succeeding

May 15 17:41:56 ip-172-23-107-94 rancher-system-agent[1291]: W0515 17:41:56.588541    1291 reflector.go:492] pkg/mod/k8s.io/client-go@v0.32.2/tools/cache/reflector.go:251: watch of *v1.Secret ended with: an error on the server ("unable to decode an event from the watch stream: stream error: stream ID 17; INTERNAL_ERROR; received from peer") has prevented the request from succeeding

May 15 17:43:57 ip-172-23-107-94 rancher-system-agent[1291]: W0515 17:43:57.409457    1291 reflector.go:492] pkg/mod/k8s.io/client-go@v0.32.2/tools/cache/reflector.go:251: watch of *v1.Secret ended with: an error on the server ("unable to decode an event from the watch stream: stream error: stream ID 21; INTERNAL_ERROR; received from peer") has prevented the request from succeeding
So rke2 is never gets provisioned there so I can't complete recovery process. Any idea much appreciated 🙏
c
look for errors in the logs on the Rancher cluster
e
I have enabled debug logs on system agent:
Copy code
May 15 17:58:03 ip-172-23-107-94 rancher-system-agent[1470]: time="2025-05-15T17:58:03Z" level=debug msg="[K8s] Processing secret custom-6a69b19510e2-machine-plan in namespace fleet-default at generation 0 with resource version 173187"

May 15 17:58:08 ip-172-23-107-94 rancher-system-agent[1470]: time="2025-05-15T17:58:08Z" level=debug msg="[K8s] Processing secret custom-6a69b19510e2-machine-plan in namespace fleet-default at generation 0 with resource version 173187"

May 15 17:58:13 ip-172-23-107-94 rancher-system-agent[1470]: time="2025-05-15T17:58:13Z" level=debug msg="[K8s] Processing secret custom-6a69b19510e2-machine-plan in namespace fleet-default at generation 0 with resource version 173187"

May 15 17:58:18 ip-172-23-107-94 rancher-system-agent[1470]: W0515 17:58:18.273269    1470 reflector.go:492] pkg/mod/k8s.io/client-go@v0.32.2/tools/cache/reflector.go:251: watch of *v1.Secret ended with: an error on the server ("unable to decode an event from the watch stream: stream error: stream ID 5; INTERNAL_ERROR; received from peer") has prevented the request from succeeding
At least I see it is able to find machine plan secret
custom-6a69b19510e2-machine-plan
checking rancher logs...
on rancher side:
Copy code
2025/05/15 17:59:59 [DEBUG] Searching for providerID for selector <http://rke.cattle.io/machine=75984c19-6ba4-4d23-8bb9-c44141054d9a|rke.cattle.io/machine=75984c19-6ba4-4d23-8bb9-c44141054d9a> in cluster fleet-default/dev-euc1-te-test06, machine custom-6a69b19510e2: an error on the server ("error trying to reach service: cluster agent disconnected") has prevented the request from succeeding (get nodes)
2025/05/15 17:59:59 [DEBUG] DesiredSet - No change(2) /v1, Kind=ServiceAccount fleet-default/custom-6a69b19510e2-machine-bootstrap for rke-bootstrap fleet-default/custom-6a69b19510e2
this happens same time as error on system agent
but that does not tell me much 😕 is rancher trying to connect to system agent on that new node?
Copy code
2025/05/15 18:06:23 [INFO] [planner] rkecluster fleet-default/dev-euc1-te-test06: rkecontrolplane was already initialized but no etcd machines exist that have plans, indicating the etcd plane has been entirely replaced. Restoration from etcd snapshot is required.
interesting
c
that’s what you’re trying to do, right? If you loose all your etcd nodes then you need to restore from a snapshot. otherwise all your data is lost.
e
yeah exactly
c
did you start the snapshot restore process yet?
e
I can't, I have removed all CP machines as doc says second step is to get new CP machine to run restore at
and that new CP node can't bootstrap as system agent got stuck
without any CP machine restore button is not even displayed
c
check the apiserver logs on the rancher cluster. according to the rancher-system-agent log you shared, it can’t watch the plan secret dude to some error from the apiserver.
is your rancher cluster itself having problems?
e
nope rancher is fine, and it's local cluster is fine other mamaged clusters are also fine
c
are there any other messages in the rancher-system-agent logs?
what did it do before and after that?
e
nope it is just stuck and print that only log line
c
It kinda looks like you have something in between the node and rancher that is closing the watch connection every 2 minutes.
but that wouldn’t explain why there’s nothing in the plan secret for it
check other pod logs on the rancher side for messages regarding that machine
you saw a message in the logs that says restore is required, are you not seeing that same thing in the UI?
e
I did check all rancher pods logs will check also now webhook and upgrade controller also checking API server logs and audit logs...
c
Did you assign the new node the correct roles?
you are reading the rke2/k3s steps there right? Not the rke steps?
1. Remove all etcd nodes from your cluster.
a. In the upper left corner, click ☰ > Cluster Management.
b. In the Clusters page, go to the cluster where you want to remove nodes.
c. In the Machines tab, click ⋮ > Delete on each node you want to delete. Initially, you will see the nodes hang in a
deleting
state, but once all etcd nodes are deleting, they will be removed together. This is due to the fact that Rancher sees all etcd nodes deleting and proceeds to “short circuit” the etcd safe-removal logic.
2. After all etcd nodes are removed, add the new etcd node that you are planning to restore from. Assign the new node the role of
all
(etcd, controlplane, and worker).
◦ If the node was previously in a cluster, clean the node first.
◦ For custom clusters, go to the Registration tab and check the box for
etcd, controlplane, and worker
. Then copy and run the registration command on your node.
◦ For node driver clusters, a new node is provisioned automatically.
3. At this point, Rancher will indicate that restoration from etcd snapshot is required.
e
you saw a message in the logs that says restore is required, are you not seeing that same thing in the UI?
both
Did you assign the new node the correct roles?
yes all 4 as per documentation
you are reading the rke2/k3s steps there right? Not the rke steps?
yes this is exactly what I am doing
I did step 1 and on step 2 new node get stuck
c
ok so what happens if you go into the snapshots list in the UI?
Yes, it will. That is where you go into snapshots and pick the one to restore.
As that page says:
At this point, Rancher will indicate that restoration from etcd snapshot is required
e
image.png
ok, I have tried to click on one of the snapshots to restore
c
yes. it looks like you’re clicking on the cluster instead of the snapshot as instructed
1. > Go to the snapshot you want to restore and click ⋮ > Restore.
If you click on the wrong thing you won’t see the restore options
e
I tried that but nothing seems to change
c
so you picked a snapshot and clicked through to start the restore?
e
I mean I tried to click restore now
image.png
yes but still system agent on that node is still sending
Copy code
May 15 18:39:41 ip-172-23-107-94 rancher-system-agent[1470]: W0515 18:39:41.992571    1470 reflector.go:492] pkg/mod/k8s.io/client-go@v0.32.2/tools/cache/reflector.go:251: watch of *v1.Secret ended with: an error on the server ("unable to decode an event from the watch stream: stream error: stream ID 97; INTERNAL_ERROR; received from peer") has prevented the request from succeeding
c
did you finish the rest of the restore dialog?
e
yes, selected to restore everything
image.png
c
try doing just etcd
e
oh that worked
only etcd I mean
c
there ya go
e
uff, thanks a ton!
hopefully it will reconcile and join all the workers to the new server
but that is another story, it is already huge progress for me. Thank you very much! 🍻
s
if y our issue is , control plain machine is taking more time to join, one solution i found is, just register one worker node to cluster, and observed whole cluster is up coming up. i dont know the real secret behind it. 🙂
c
cluster needs machines for all roles. If you have only created a single machine with etcd+control-plane, it will wait for a worker to be created before proceeding. I believe the UI even tells you this…
e
My issue was that I was trying to restore with Cluster config (option 3) which never actually triggered recovery procedure on the pending node it work only with option 1 "Only ETCD" and option 2 "k8s version and ETCD"
I had also many other issues when recovery procedure started with kyverno and Cilium but these were easy to fix.