This message was deleted.
# rke2
a
This message was deleted.
a
I was able to clear out the now terminated etcd node by removing the finalizer, now the cluster is waiting for a new etcd node (and then to be restored from a snapshot). I'm guessing I'll run into the same thing if I try again. It's possible I ran into this bug but I'm not entirely sure.
I went ahead and tried again with a fresh etcd node, after running the register command I just get this
Copy code
[INFO]  Successfully downloaded the rancher-system-agent binary.
[INFO]  Downloading rancher-system-agent-uninstall.sh script from https://<rancher_url>/assets/system-agent-uninstall.sh
[INFO]  Successfully downloaded the rancher-system-agent-uninstall.sh script.
[INFO]  Generating Cattle ID
[ERROR]  401 received while downloading Rancher connection information. Sleeping for 5 seconds and trying again
curl: (28) Operation timed out after 60000 milliseconds with 0 bytes received
[ERROR]  000 received while downloading Rancher connection information. Sleeping for 5 seconds and trying again
p
If you do
kubectl get <http://clusters.cluster.x-k8s.io|clusters.cluster.x-k8s.io> -n fleet-default
on your upstream(local) cluster, you will get the list of downstream clusters. In the
<http://clusters.cluster.x-k8s.io|clusters.cluster.x-k8s.io>
file for the cluster you are trying to restore, you will find this parameter
spec.infrastructureRef.paused
which might be set to
true
which is likely caused by the first ETCD restore that has put the cluster in a paused state. This could be the reason why the joining process to the cluster hung and was never completed properly. If you edit this CR object and set the value to false, the registration process would go through.
a
@plain-planet-80115 Thank you. I ended up blowing away the cluster and starting fresh, when I get some cycles I'll try again and see if that's part of the issue. I'm still concerned that the restore failed in the first place, makes me nervous now that we're running RKE2 in prod
p
@abundant-hair-58573 I can understand. We too have RKE2 clusters in Production. I'd recommend relying on some third party disaster recovery solutions.