Hi all, I'm currently running into an issue with a...
# general
m
Hi all, I'm currently running into an issue with a rke2 cluster (v1.28.10) created on Rancher (2.8.5) with the vSphere Node Driver where Nodes do not get provisioned for a NodePool. I tried to restore the cluster from a S3 etcd backup yesterday which somehow bricked the whole cluster and left cp/etcd nodes unavailable. I tried to follow the instructions here (https://ranchermanager.docs.rancher.com/how-to-guides/new-user-guides/backup-restore-and-dis[…]tore-rancher-launched-kubernetes-clusters-from-backup) and deleted all cp/etcd nodes from the cluster. Deleting them via Rancher UI did not work (was stuck in state
waiting for all etcd nodes to be deleted
for several hours), so I deleted the nodes manually in vSphere and removed the finalizers on the respective Machines and VmwarevsphereMachines. After this the cluster is stuck on
Waiting for at least one control plane, etcd, and worker node to be registered
but no new nodes are getting provisioned. I tried creating a different NodePool for cp & etcd nodes but nodes are also not getting provisioned there. The MachineDeployment and VmwarevsphereConfig got created, but no MachineSets, Machines or VmwarevphereMachines were created. I can still change NodePools of other clusters and the nodes are created fine, so the vSphere connection should be working fine. Does anybody have any ideas how I can recover this cluster? Thanks!
Found the error thanks to this issue: https://github.com/rancher/rancher/issues/43735
spec.paused
was set to true in the CAPI Cluster
f
Another note on using the vSphere provisioner. In your cloud-init section add a wait operation for open-vm-tools. The provisioner sometimes encounters a race condition wherein the node will join cluster-api but the machine secret will be deleted causing a diff between what cluster management vs cluster api. The wait allows VMware tools to report back and prevent looping of nodes as they are being replaced.
🙏 1