Hello does anyone had issues restoring RKE2 managed by Ranch Rancher Users #rke2

Hello, does anyone had issues restoring RKE2 manag...

eager-refrigerator-66976

05/15/2025, 5:49 PM

Hello, does anyone had issues restoring RKE2 managed by Rancher cluster after losing all control-plane nodes? I am following this doc https://ranchermanager.docs.rancher.com/how-to-guides/new-user-guides/backup-restore-and-dis[…]tore-rancher-launched-kubernetes-clusters-from-backup 1. I have removed all CP machines 2. I have executed join command on a new node with all the roles selected and new node gets stuck with

rancher-system-agent.service

not being able to bootstrap the node here is what I see in

rancher-system-agent

logs

Copy code

May 15 17:39:55 ip-172-23-107-94 rancher-system-agent[1291]: W0515 17:39:55.025335    1291 reflector.go:492] pkg/mod/k8s.io/client-go@v0.32.2/tools/cache/reflector.go:251: watch of *v1.Secret ended with: an error on the server ("unable to decode an event from the watch stream: stream error: stream ID 13; INTERNAL_ERROR; received from peer") has prevented the request from succeeding

May 15 17:41:56 ip-172-23-107-94 rancher-system-agent[1291]: W0515 17:41:56.588541    1291 reflector.go:492] pkg/mod/k8s.io/client-go@v0.32.2/tools/cache/reflector.go:251: watch of *v1.Secret ended with: an error on the server ("unable to decode an event from the watch stream: stream error: stream ID 17; INTERNAL_ERROR; received from peer") has prevented the request from succeeding

May 15 17:43:57 ip-172-23-107-94 rancher-system-agent[1291]: W0515 17:43:57.409457    1291 reflector.go:492] pkg/mod/k8s.io/client-go@v0.32.2/tools/cache/reflector.go:251: watch of *v1.Secret ended with: an error on the server ("unable to decode an event from the watch stream: stream error: stream ID 21; INTERNAL_ERROR; received from peer") has prevented the request from succeeding

So rke2 is never gets provisioned there so I can't complete recovery process. Any idea much appreciated 🙏

creamy-pencil-82913

05/15/2025, 5:57 PM

look for errors in the logs on the Rancher cluster

eager-refrigerator-66976

05/15/2025, 6:00 PM

I have enabled debug logs on system agent:

Copy code

May 15 17:58:03 ip-172-23-107-94 rancher-system-agent[1470]: time="2025-05-15T17:58:03Z" level=debug msg="[K8s] Processing secret custom-6a69b19510e2-machine-plan in namespace fleet-default at generation 0 with resource version 173187"

May 15 17:58:08 ip-172-23-107-94 rancher-system-agent[1470]: time="2025-05-15T17:58:08Z" level=debug msg="[K8s] Processing secret custom-6a69b19510e2-machine-plan in namespace fleet-default at generation 0 with resource version 173187"

May 15 17:58:13 ip-172-23-107-94 rancher-system-agent[1470]: time="2025-05-15T17:58:13Z" level=debug msg="[K8s] Processing secret custom-6a69b19510e2-machine-plan in namespace fleet-default at generation 0 with resource version 173187"

May 15 17:58:18 ip-172-23-107-94 rancher-system-agent[1470]: W0515 17:58:18.273269    1470 reflector.go:492] pkg/mod/k8s.io/client-go@v0.32.2/tools/cache/reflector.go:251: watch of *v1.Secret ended with: an error on the server ("unable to decode an event from the watch stream: stream error: stream ID 5; INTERNAL_ERROR; received from peer") has prevented the request from succeeding

At least I see it is able to find machine plan secret

custom-6a69b19510e2-machine-plan

checking rancher logs...

eager-refrigerator-66976

05/15/2025, 6:03 PM

on rancher side:

Copy code

2025/05/15 17:59:59 [DEBUG] Searching for providerID for selector <http://rke.cattle.io/machine=75984c19-6ba4-4d23-8bb9-c44141054d9a|rke.cattle.io/machine=75984c19-6ba4-4d23-8bb9-c44141054d9a> in cluster fleet-default/dev-euc1-te-test06, machine custom-6a69b19510e2: an error on the server ("error trying to reach service: cluster agent disconnected") has prevented the request from succeeding (get nodes)
2025/05/15 17:59:59 [DEBUG] DesiredSet - No change(2) /v1, Kind=ServiceAccount fleet-default/custom-6a69b19510e2-machine-bootstrap for rke-bootstrap fleet-default/custom-6a69b19510e2

this happens same time as error on system agent

eager-refrigerator-66976

05/15/2025, 6:04 PM

but that does not tell me much 😕 is rancher trying to connect to system agent on that new node?

eager-refrigerator-66976

05/15/2025, 6:07 PM

Copy code

2025/05/15 18:06:23 [INFO] [planner] rkecluster fleet-default/dev-euc1-te-test06: rkecontrolplane was already initialized but no etcd machines exist that have plans, indicating the etcd plane has been entirely replaced. Restoration from etcd snapshot is required.

eager-refrigerator-66976

05/15/2025, 6:08 PM

interesting

creamy-pencil-82913

05/15/2025, 6:09 PM

that’s what you’re trying to do, right? If you loose all your etcd nodes then you need to restore from a snapshot. otherwise all your data is lost.

eager-refrigerator-66976

05/15/2025, 6:09 PM

yeah exactly

creamy-pencil-82913

05/15/2025, 6:09 PM

did you start the snapshot restore process yet?

eager-refrigerator-66976

05/15/2025, 6:10 PM

I can't, I have removed all CP machines as doc says second step is to get new CP machine to run restore at

eager-refrigerator-66976

05/15/2025, 6:10 PM

and that new CP node can't bootstrap as system agent got stuck

eager-refrigerator-66976

05/15/2025, 6:11 PM

without any CP machine restore button is not even displayed

creamy-pencil-82913

05/15/2025, 6:23 PM

check the apiserver logs on the rancher cluster. according to the rancher-system-agent log you shared, it can’t watch the plan secret dude to some error from the apiserver.

creamy-pencil-82913

05/15/2025, 6:24 PM

is your rancher cluster itself having problems?

eager-refrigerator-66976

05/15/2025, 6:26 PM

nope rancher is fine, and it's local cluster is fine other mamaged clusters are also fine

creamy-pencil-82913

05/15/2025, 6:29 PM

are there any other messages in the rancher-system-agent logs?

creamy-pencil-82913

05/15/2025, 6:30 PM

what did it do before and after that?

eager-refrigerator-66976

05/15/2025, 6:30 PM

nope it is just stuck and print that only log line

creamy-pencil-82913

05/15/2025, 6:30 PM

It kinda looks like you have something in between the node and rancher that is closing the watch connection every 2 minutes.

creamy-pencil-82913

05/15/2025, 6:31 PM

but that wouldn’t explain why there’s nothing in the plan secret for it

creamy-pencil-82913

05/15/2025, 6:31 PM

check other pod logs on the rancher side for messages regarding that machine

creamy-pencil-82913

05/15/2025, 6:33 PM

you saw a message in the logs that says restore is required, are you not seeing that same thing in the UI?

eager-refrigerator-66976

05/15/2025, 6:33 PM

I did check all rancher pods logs will check also now webhook and upgrade controller also checking API server logs and audit logs...

creamy-pencil-82913

05/15/2025, 6:33 PM

Did you assign the new node the correct roles?

creamy-pencil-82913

05/15/2025, 6:33 PM

you are reading the rke2/k3s steps there right? Not the rke steps?

creamy-pencil-82913

05/15/2025, 6:34 PM

1. Remove all etcd nodes from your cluster.

a. In the upper left corner, click ☰ > Cluster Management.

b. In the Clusters page, go to the cluster where you want to remove nodes.

c. In the Machines tab, click ⋮ > Delete on each node you want to delete. Initially, you will see the nodes hang in a
deleting
state, but once all etcd nodes are deleting, they will be removed together. This is due to the fact that Rancher sees all etcd nodes deleting and proceeds to “short circuit” the etcd safe-removal logic.

2. After all etcd nodes are removed, add the new etcd node that you are planning to restore from. Assign the new node the role of
all
(etcd, controlplane, and worker).

◦ If the node was previously in a cluster, clean the node first.

◦ For custom clusters, go to the Registration tab and check the box for
etcd, controlplane, and worker
. Then copy and run the registration command on your node.

◦ For node driver clusters, a new node is provisioned automatically.

3. At this point, Rancher will indicate that restoration from etcd snapshot is required.

eager-refrigerator-66976

05/15/2025, 6:34 PM

you saw a message in the logs that says restore is required, are you not seeing that same thing in the UI?

both

Did you assign the new node the correct roles?

yes all 4 as per documentation

you are reading the rke2/k3s steps there right? Not the rke steps?

yes this is exactly what I am doing

eager-refrigerator-66976

05/15/2025, 6:35 PM

I did step 1 and on step 2 new node get stuck

creamy-pencil-82913

05/15/2025, 6:35 PM

ok so what happens if you go into the snapshots list in the UI?

creamy-pencil-82913

05/15/2025, 6:35 PM

Yes, it will. That is where you go into snapshots and pick the one to restore.

creamy-pencil-82913

05/15/2025, 6:35 PM

As that page says:

At this point, Rancher will indicate that restoration from etcd snapshot is required

eager-refrigerator-66976

05/15/2025, 6:35 PM

image.png

eager-refrigerator-66976

05/15/2025, 6:36 PM

ok, I have tried to click on one of the snapshots to restore

creamy-pencil-82913

05/15/2025, 6:37 PM

yes. it looks like you’re clicking on the cluster ⋮ instead of the snapshot ⋮ as instructed

creamy-pencil-82913

05/15/2025, 6:37 PM

1. > Go to the snapshot you want to restore and click ⋮ > Restore.

creamy-pencil-82913

05/15/2025, 6:38 PM

If you click on the wrong thing you won’t see the restore options

eager-refrigerator-66976

05/15/2025, 6:38 PM

I tried that but nothing seems to change

creamy-pencil-82913

05/15/2025, 6:39 PM

so you picked a snapshot and clicked through to start the restore?

eager-refrigerator-66976

05/15/2025, 6:39 PM

I mean I tried to click restore now

eager-refrigerator-66976

05/15/2025, 6:39 PM

image.png

eager-refrigerator-66976

05/15/2025, 6:39 PM

yes but still system agent on that node is still sending

Copy code

May 15 18:39:41 ip-172-23-107-94 rancher-system-agent[1470]: W0515 18:39:41.992571    1470 reflector.go:492] pkg/mod/k8s.io/client-go@v0.32.2/tools/cache/reflector.go:251: watch of *v1.Secret ended with: an error on the server ("unable to decode an event from the watch stream: stream error: stream ID 97; INTERNAL_ERROR; received from peer") has prevented the request from succeeding

creamy-pencil-82913

05/15/2025, 6:40 PM

did you finish the rest of the restore dialog?

eager-refrigerator-66976

05/15/2025, 6:41 PM

yes, selected to restore everything

eager-refrigerator-66976

05/15/2025, 6:41 PM

image.png

creamy-pencil-82913

05/15/2025, 6:42 PM

try doing just etcd

eager-refrigerator-66976

05/15/2025, 6:43 PM

oh that worked

eager-refrigerator-66976

05/15/2025, 6:43 PM

only etcd I mean

creamy-pencil-82913

05/15/2025, 6:43 PM

there ya go

eager-refrigerator-66976

05/15/2025, 6:43 PM

uff, thanks a ton!

eager-refrigerator-66976

05/15/2025, 6:47 PM

hopefully it will reconcile and join all the workers to the new server

eager-refrigerator-66976

05/15/2025, 6:48 PM

but that is another story, it is already huge progress for me. Thank you very much! 🍻

straight-actor-37028

05/23/2025, 7:21 PM

if y our issue is , control plain machine is taking more time to join, one solution i found is, just register one worker node to cluster, and observed whole cluster is up coming up. i dont know the real secret behind it. 🙂

creamy-pencil-82913

05/23/2025, 7:27 PM

cluster needs machines for all roles. If you have only created a single machine with etcd+control-plane, it will wait for a worker to be created before proceeding. I believe the UI even tells you this…

eager-refrigerator-66976

05/27/2025, 8:59 AM

My issue was that I was trying to restore with Cluster config (option 3) which never actually triggered recovery procedure on the pending node it work only with option 1 "Only ETCD" and option 2 "k8s version and ETCD"

eager-refrigerator-66976

05/27/2025, 9:01 AM

I had also many other issues when recovery procedure started with kyverno and Cilium but these were easy to fix.

40 Views

Open in Slack

Previous Next