Hello, I have been trying to revive my rke2 cluste...
# rke2
b
Hello, I have been trying to revive my rke2 cluster which is provisioned using rancher. here is the brief story/history: • Rancher lost access to the rke2 cluster because the
kube-scheduler
and
kube-controller-manager
certificates were not rotated automatically by rancher. • etcds were failing to start as they couldnt communicate with control planes. • so I assumed cluster is failed and everything is haywire, I decided to perform the DR because I had the etcd snapshots from the day earlier. • followed this guide first which asks to remove control planes from the rancher ◦ https://support.tools/post/rke2-with-rancher-disaster-recovery/ • above did not work so then followed this guide ◦ https://ranchermanager.docs.rancher.com/how-to-guides/new-user-guides/backup-restore-and-dis[…]tore-rancher-launched-kubernetes-clusters-from-backup Current situation of the cluster 1. In rancher, cluster has left with the 3 worker nodes, all the CP and ETCDs are removed from the rancher. 2. trying to add a node to the cluster but gives the following error message when I run the cluster registration command copied from rancher
Copy code
curl -fL <https://rancher.internal/system-agent-install.sh> | sudo  sh -s - --server <https://rancher.internal> --label '<http://cattle.io/os=linux|cattle.io/os=linux>' --token <token> --ca-checksum <checksum> --etcd --controlplane --worker

[INFO]  Label: <http://cattle.io/os=linux|cattle.io/os=linux>
[INFO]  Role requested: etcd
[INFO]  Role requested: controlplane
[INFO]  Role requested: worker
[INFO]  CA strict verification is set to false
[INFO]  Using default agent configuration directory /etc/rancher/agent
[INFO]  Using default agent var directory /var/lib/rancher/agent
[INFO]  Determined CA is not necessary to connect to Rancher
[INFO]  Successfully tested Rancher connection
[INFO]  Downloading rancher-system-agent binary from <https://rancher.internal/assets/rancher-system-agent-amd64>
[INFO]  Successfully downloaded the rancher-system-agent binary.
[INFO]  Downloading rancher-system-agent-uninstall.sh script from <https://rancher.internal/assets/system-agent-uninstall.sh>
[INFO]  Successfully downloaded the rancher-system-agent-uninstall.sh script.
[INFO]  Generating Cattle ID
curl: (28) Operation timed out after 60002 milliseconds with 0 bytes received
[ERROR]  000 received while downloading Rancher connection information. Sleeping for 5 seconds and trying again
3. I did little bit of troubleshooting and found out that the
system-agent-install.sh
script which is run to register a node to the current cluster get stuck at this command:
Copy code
curl --connect-timeout 60 --max-time 60 --write-out '%{http_code}\n' -sS -H 'Authorization: Bearer <token>' -H 'X-Cattle-Id: f8bcebdca8c1dcce980ee7d67b583b5b3db64419bc3a0e130f8a1369a8a395a' -H 'X-Cattle-Role-Etcd: true' -H 'X-Cattle-Role-Control-Plane: true' -H 'X-Cattle-Role-Worker: true' -H 'X-Cattle-Node-Name: <eradicated>' -H 'X-Cattle-Address: ' -H 'X-Cattle-Internal-Address: <eradicated>' -H 'X-Cattle-Labels: <http://cattle.io/os=linux|cattle.io/os=linux>' -H 'X-Cattle-Taints: ' <https://rancher.internal/v3/connect/agent> -o /var/lib/rancher/agent/rancher2_connection_info.json
is there a bug in rancher? is it not possible to register any node after you remove all the control planes/etcd?
m
Just to confirm, you deleted the control plane nodes from your cluster?
b
yes I did. I followed the guide mentioned above for the DR
b
Maybe, in your case, manual certificate rotation would have helped. See https://docs.rke2.io/security/certificates . You should never delete all the control plane nodes. The cluster basically is the control plane.
b
thanks but Certificate rotation is not an issue anymore. its the issue of nodes not being able to add to the cluster
b
Well, I was trying to tell you, that, technically, you don't have a cluster anymore after deleting all the control plane nodes. And you also didn't say something about a backup that you kept outside of the control plane nodes. Delete the workers and start over!
b
oh yes you are right that the cluster doesn't exist anymore. what I am curious about is the possibility to force these worker nodes to be a part of same rancher cluster instance. even though i have to start over.
b
The cluster instance is just a bunch of yaml in Rancher. See it as a blueprint of the actual cluster. The configuration, workload etc. is not persisted in Rancher but on the control plane nodes in etcd. Rancher is the management instance. A normal cluster (with control plane nodes) is continuing to work if you switch off Rancher.
b
yes thats what i did. I restored the etcd snapshot on one of the node with all the roles and re-attached all the nodes (even the worker nodes without loosing any data). But its independent of rancher. What I actually need is to attach the new cluster to the same old rancher cluster instance which I had before because it is being managed by code. is there any way to do it?
b
If it was really provisioned by Rancher, I don't know any way. You can detach it, remove it from Rancher and import it again, but then it's just an imported cluster. Check the docs for differences.