This message was deleted Rancher Users #rke2

Join Slack

This message was deleted.

# rke2

adamant-kite-43734

02/09/2024, 3:40 PM

This message was deleted.

abundant-hair-58573

02/09/2024, 4:37 PM

I was able to clear out the now terminated etcd node by removing the finalizer, now the cluster is waiting for a new etcd node (and then to be restored from a snapshot). I'm guessing I'll run into the same thing if I try again. It's possible I ran into this bug but I'm not entirely sure.

abundant-hair-58573

02/09/2024, 5:25 PM

I went ahead and tried again with a fresh etcd node, after running the register command I just get this

Copy code

[INFO]  Successfully downloaded the rancher-system-agent binary.
[INFO]  Downloading rancher-system-agent-uninstall.sh script from https://<rancher_url>/assets/system-agent-uninstall.sh
[INFO]  Successfully downloaded the rancher-system-agent-uninstall.sh script.
[INFO]  Generating Cattle ID
[ERROR]  401 received while downloading Rancher connection information. Sleeping for 5 seconds and trying again
curl: (28) Operation timed out after 60000 milliseconds with 0 bytes received
[ERROR]  000 received while downloading Rancher connection information. Sleeping for 5 seconds and trying again

plain-planet-80115

02/21/2024, 6:11 AM

If you do

kubectl get <http://clusters.cluster.x-k8s.io|clusters.cluster.x-k8s.io> -n fleet-default

on your upstream(local) cluster, you will get the list of downstream clusters. In the <http://clusters.cluster.x-k8s.io|clusters.cluster.x-k8s.io>
file for the cluster you are trying to restore, you will find this parameter spec.infrastructureRef.paused
which might be set to

true

which is likely caused by the first ETCD restore that has put the cluster in a paused state. This could be the reason why the joining process to the cluster hung and was never completed properly. If you edit this CR object and set the value to false, the registration process would go through.

abundant-hair-58573

02/28/2024, 1:32 AM

@plain-planet-80115 Thank you. I ended up blowing away the cluster and starting fresh, when I get some cycles I'll try again and see if that's part of the issue. I'm still concerned that the restore failed in the first place, makes me nervous now that we're running RKE2 in prod

plain-planet-80115

02/28/2024, 5:17 AM

@abundant-hair-58573 I can understand. We too have RKE2 clusters in Production. I'd recommend relying on some third party disaster recovery solutions.

2 Views

Open in Slack

Previous Next