I am stuck for a week now. I am trying to recover ...
# general
b
I am stuck for a week now. I am trying to recover my cluster using DR guide. I am unable to add any node to my rke2 cluster which is managed by rancher but registration command is stuck after
Generating Cattle id
My question is: Is it compulsory to have a working CP node in order to register a node in the cluster using rancher registration token?
f
I'm not entirely sure, but I have to ask -- Is your RKE2 cluster running bare metal or in a cloud provider.
Usually the first step in recovering a k8s cluster is getting your control-plane healthy. If your CP isn't healthy (ie: doesn't have quorum) then I'm not sure what else you can do with it
I would love to be corrected, though
b
thanks for your reply. RKE2 cluster is running on bare metal. the cluster itself does not have a CP or etcd but it has worker nodes but still rancher shows the following message in banner
I did little bit of troubleshooting and found out that the
system-agent-install.sh
script which is run to register a node to the current cluster get stuck at this command:
Copy code
curl --connect-timeout 60 --max-time 60 --write-out '%{http_code}\n' -sS -H 'Authorization: Bearer <token>' -H 'X-Cattle-Id: f8bcebdca8c1dcce980ee7d67b583b5b3db64419bc3a0e130f8a1369a8a395a' -H 'X-Cattle-Role-Etcd: true' -H 'X-Cattle-Role-Control-Plane: true' -H 'X-Cattle-Role-Worker: true' -H 'X-Cattle-Node-Name: <eradicated>' -H 'X-Cattle-Address: ' -H 'X-Cattle-Internal-Address: <eradicated>' -H 'X-Cattle-Labels: <http://cattle.io/os=linux|cattle.io/os=linux>' -H 'X-Cattle-Taints: ' <https://rancher.internal/v3/connect/agent> -o /var/lib/rancher/agent/rancher2_connection_info.json
f
What does it look like on the RKE2 node if you run
kubectl get nodes
?
Not sure entirely what's going on, but if the registration process requires a Pod to be running, an unhealthy CP will likely inhibit the running of a Pod
b
kubectl get nodes
shows nothing as rke2 installation is not successful and is stuck after
Generating Cattle id
Copy code
[INFO]  Label: cattle.io/os=linux
[INFO]  Role requested: etcd
[INFO]  Role requested: controlplane
[INFO]  Role requested: worker
[INFO]  CA strict verification is set to false
[INFO]  Using default agent configuration directory /etc/rancher/agent
[INFO]  Using default agent var directory /var/lib/rancher/agent
[INFO]  Determined CA is not necessary to connect to Rancher
[INFO]  Successfully tested Rancher connection
[INFO]  Downloading rancher-system-agent binary from <https://rancher.internal/assets/rancher-system-agent-amd64>
[INFO]  Successfully downloaded the rancher-system-agent binary.
[INFO]  Downloading rancher-system-agent-uninstall.sh script from <https://rancher.internal/assets/system-agent-uninstall.sh>
[INFO]  Successfully downloaded the rancher-system-agent-uninstall.sh script.
[INFO]  Generating Cattle ID
curl: (28) Operation timed out after 60002 milliseconds with 0 bytes received
[ERROR]  000 received while downloading Rancher connection information. Sleeping for 5 seconds and trying again
f
ohhhh gotcha; you're bootstrapping a cluster
I was mistaken then, I'm probably leading you down an incorrect path. For whatever reason I was thinking you were trying to add a new node to an existing k8s cluster... I'm not sure how bootstrapping works between Rancher & RKE2
b
I am actually trying to add a node to existing cluster which was failed. I had to remove the etcd and control planes and all I have left is now the worker nodes. So right now i am trying to add a node to this existing cluster and facing the above mentioned issue.
f
gotcha; i am really at a loss then. My experience with Rancher has largely been with importing hand-provisioned K3S clusters into Rancher. If my control-plane nodes all died and all I had were worker nodes left, I would think I'd probably be at "uninstall k3s and build a k3s cluster from scratch as if it were brand-new". A cluster without a control-plane or recoverable control-plane data isn't much of a cluster. That could change when you swap k3s with RKE2 and when you swap "I'm building the cluster using k3s directly and adding to Rancher" vs "I'm building the cluster through Rancher". Never the less, I've never been in this spot myself, so I'm not going to be much help. Apologies for the confusion
b
well thank you for your intended hep, much appreciated. well yeah that has really important workload for me so I must revive it somehow. since I was unable to register any node to the failed cluster with rancher, I installed rke2 on a separate host and restored the etcd snapshot on it. I could see all the cluster state intact on the new host. but that node cannot communicate with other nodes as i need to migrate/ register the worker nodes there. do you know is there a way to migrate the worker nodes to that new host ? like in such a way that I dont loose the data of pv in longhorn?
f
#CC2UQM49Y is probably a good place to ask for Longhorn bit. IMO - if the data is important, I'd be grabbing snapshots of the PV images before you do anything. In fact, if you can, make a backup or snapshot of the nodes period if at all possible. When you get it back up and online, I'd consider setting up longhorn backups too.
In k3s, the control node(s) or URLs are usually specified in a config file or directly in the systemd unit for k3s-agent that configures the worker nodes. K3S Config:
/etc/rancher/k3s/config.yaml
SystemD unit
systemctl cat k3s-agent
In the past, I had to recover a k3s workers where the 1 of 3 control-nodes failed and I accidentally pinned the compute nodes to that first control node, so I'd go in and modify the "server" option to point to a new server
Example config file in /etc/rancher/k3s/config.yaml
Copy code
$ cat /etc/rancher/k3s/config.yaml
node-name: compute-node-1
server: <https://control-node-1:6443>
token: K...::server:....
In that case, I updated the control-node-1 to be something else
b
I could find the similar file at the path
/etc/rancher/rke2/config.yaml.d/50-rancher.yamlon
rke2 worker node as well.
Copy code
{
  "node-label": [
    "<http://cattle.io/os=linux|cattle.io/os=linux>",
    "<http://rke.cattle.io/machine=b67b5f2c-9d28-4be1-8bfb-cb6e0768eb70|rke.cattle.io/machine=b67b5f2c-9d28-4be1-8bfb-cb6e0768eb70>"
  ],
  "private-registry": "/etc/rancher/rke2/registries.yaml",
  "server": "<https://10.0.20.165:9345>",
  "token": "<token>"
}
so if just update the
"server": "<https://10.0.20.165:9345>
address with the new control plane and restart the rke2-agent service, would it simply work?
f
You know, I'm not sure but assuming you have backups, that might be where I start. You might cross post in #C01PHNP149L to be sure since you're in the realm of recovering your RKE cluster and not necessarily troubleshooting Rancher itself
Like I said, I don't have much experience with rke2. It's similar to k3s but if it's important, it's probably worth double-checking
🙏 1