I am stuck for a week now I am trying to recover my cluster Rancher Users #general

I am stuck for a week now. I am trying to recover ...

brash-waitress-85312

06/26/2025, 1:41 PM

I am stuck for a week now. I am trying to recover my cluster using DR guide. I am unable to add any node to my rke2 cluster which is managed by rancher but registration command is stuck after

Generating Cattle id

My question is: Is it compulsory to have a working CP node in order to register a node in the cluster using rancher registration token?

future-fountain-82544

06/26/2025, 2:29 PM

I'm not entirely sure, but I have to ask -- Is your RKE2 cluster running bare metal or in a cloud provider.

future-fountain-82544

06/26/2025, 2:30 PM

Usually the first step in recovering a k8s cluster is getting your control-plane healthy. If your CP isn't healthy (ie: doesn't have quorum) then I'm not sure what else you can do with it

future-fountain-82544

06/26/2025, 2:30 PM

I would love to be corrected, though

brash-waitress-85312

06/26/2025, 2:53 PM

thanks for your reply. RKE2 cluster is running on bare metal. the cluster itself does not have a CP or etcd but it has worker nodes but still rancher shows the following message in banner

brash-waitress-85312

06/26/2025, 3:00 PM

I did little bit of troubleshooting and found out that the

system-agent-install.sh

script which is run to register a node to the current cluster get stuck at this command:

Copy code

curl --connect-timeout 60 --max-time 60 --write-out '%{http_code}\n' -sS -H 'Authorization: Bearer <token>' -H 'X-Cattle-Id: f8bcebdca8c1dcce980ee7d67b583b5b3db64419bc3a0e130f8a1369a8a395a' -H 'X-Cattle-Role-Etcd: true' -H 'X-Cattle-Role-Control-Plane: true' -H 'X-Cattle-Role-Worker: true' -H 'X-Cattle-Node-Name: <eradicated>' -H 'X-Cattle-Address: ' -H 'X-Cattle-Internal-Address: <eradicated>' -H 'X-Cattle-Labels: <http://cattle.io/os=linux|cattle.io/os=linux>' -H 'X-Cattle-Taints: ' <https://rancher.internal/v3/connect/agent> -o /var/lib/rancher/agent/rancher2_connection_info.json

future-fountain-82544

06/26/2025, 3:08 PM

What does it look like on the RKE2 node if you run

kubectl get nodes

future-fountain-82544

06/26/2025, 3:10 PM

Not sure entirely what's going on, but if the registration process requires a Pod to be running, an unhealthy CP will likely inhibit the running of a Pod

brash-waitress-85312

06/26/2025, 3:15 PM

kubectl get nodes

shows nothing as rke2 installation is not successful and is stuck after

Generating Cattle id

Copy code

[INFO]  Label: cattle.io/os=linux
[INFO]  Role requested: etcd
[INFO]  Role requested: controlplane
[INFO]  Role requested: worker
[INFO]  CA strict verification is set to false
[INFO]  Using default agent configuration directory /etc/rancher/agent
[INFO]  Using default agent var directory /var/lib/rancher/agent
[INFO]  Determined CA is not necessary to connect to Rancher
[INFO]  Successfully tested Rancher connection
[INFO]  Downloading rancher-system-agent binary from <https://rancher.internal/assets/rancher-system-agent-amd64>
[INFO]  Successfully downloaded the rancher-system-agent binary.
[INFO]  Downloading rancher-system-agent-uninstall.sh script from <https://rancher.internal/assets/system-agent-uninstall.sh>
[INFO]  Successfully downloaded the rancher-system-agent-uninstall.sh script.
[INFO]  Generating Cattle ID
curl: (28) Operation timed out after 60002 milliseconds with 0 bytes received
[ERROR]  000 received while downloading Rancher connection information. Sleeping for 5 seconds and trying again

future-fountain-82544

06/26/2025, 3:23 PM

ohhhh gotcha; you're bootstrapping a cluster

future-fountain-82544

06/26/2025, 3:25 PM

I was mistaken then, I'm probably leading you down an incorrect path. For whatever reason I was thinking you were trying to add a new node to an existing k8s cluster... I'm not sure how bootstrapping works between Rancher & RKE2

brash-waitress-85312

06/26/2025, 3:27 PM

I am actually trying to add a node to existing cluster which was failed. I had to remove the etcd and control planes and all I have left is now the worker nodes. So right now i am trying to add a node to this existing cluster and facing the above mentioned issue.

future-fountain-82544

06/26/2025, 3:38 PM

gotcha; i am really at a loss then. My experience with Rancher has largely been with importing hand-provisioned K3S clusters into Rancher. If my control-plane nodes all died and all I had were worker nodes left, I would think I'd probably be at "uninstall k3s and build a k3s cluster from scratch as if it were brand-new". A cluster without a control-plane or recoverable control-plane data isn't much of a cluster. That could change when you swap k3s with RKE2 and when you swap "I'm building the cluster using k3s directly and adding to Rancher" vs "I'm building the cluster through Rancher". Never the less, I've never been in this spot myself, so I'm not going to be much help. Apologies for the confusion

brash-waitress-85312

06/26/2025, 3:46 PM

well thank you for your intended hep, much appreciated. well yeah that has really important workload for me so I must revive it somehow. since I was unable to register any node to the failed cluster with rancher, I installed rke2 on a separate host and restored the etcd snapshot on it. I could see all the cluster state intact on the new host. but that node cannot communicate with other nodes as i need to migrate/ register the worker nodes there. do you know is there a way to migrate the worker nodes to that new host ? like in such a way that I dont loose the data of pv in longhorn?

future-fountain-82544

06/26/2025, 4:01 PM

#CC2UQM49Y is probably a good place to ask for Longhorn bit. IMO - if the data is important, I'd be grabbing snapshots of the PV images before you do anything. In fact, if you can, make a backup or snapshot of the nodes period if at all possible. When you get it back up and online, I'd consider setting up longhorn backups too.

future-fountain-82544

06/26/2025, 4:04 PM

In k3s, the control node(s) or URLs are usually specified in a config file or directly in the systemd unit for k3s-agent that configures the worker nodes. K3S Config:

/etc/rancher/k3s/config.yaml

SystemD unit

systemctl cat k3s-agent

In the past, I had to recover a k3s workers where the 1 of 3 control-nodes failed and I accidentally pinned the compute nodes to that first control node, so I'd go in and modify the "server" option to point to a new server

future-fountain-82544

06/26/2025, 4:05 PM

Example config file in /etc/rancher/k3s/config.yaml

Copy code

$ cat /etc/rancher/k3s/config.yaml
node-name: compute-node-1
server: <https://control-node-1:6443>
token: K...::server:....

In that case, I updated the control-node-1 to be something else

brash-waitress-85312

06/26/2025, 4:35 PM

I could find the similar file at the path

/etc/rancher/rke2/config.yaml.d/50-rancher.yamlon

rke2 worker node as well.

Copy code

{
  "node-label": [
    "<http://cattle.io/os=linux|cattle.io/os=linux>",
    "<http://rke.cattle.io/machine=b67b5f2c-9d28-4be1-8bfb-cb6e0768eb70|rke.cattle.io/machine=b67b5f2c-9d28-4be1-8bfb-cb6e0768eb70>"
  ],
  "private-registry": "/etc/rancher/rke2/registries.yaml",
  "server": "<https://10.0.20.165:9345>",
  "token": "<token>"
}

so if just update the

"server": "<https://10.0.20.165:9345>

address with the new control plane and restart the rke2-agent service, would it simply work?

future-fountain-82544

06/26/2025, 4:38 PM

You know, I'm not sure but assuming you have backups, that might be where I start. You might cross post in #C01PHNP149L to be sure since you're in the realm of recovering your RKE cluster and not necessarily troubleshooting Rancher itself

future-fountain-82544

06/26/2025, 4:41 PM

Like I said, I don't have much experience with rke2. It's similar to k3s but if it's important, it's probably worth double-checking

🙏 1

3 Views

Open in Slack

Previous Next