Yeah it was a sad Friday night. I didn't mean to lose the PVs is what made it the worse. I had backups, but not as recent as I should. Previous cluster failures I could always mount the PVs on a VM to recover latest data during a rebuild, even if I had lost the guest cluster. But this time I managed to lose all PVs.
To be clear/fair, I run harvester at home to be more into the woods of k8s internals. So I do knowingly start expirementing/trying to solve things instead of a more cautious enterprise approach.
I am not sure what exactly triggered the loss of the PVs though. A collection of random things I tried.
Roughly what happened that night..
When making a sizing change to the nodes post upgrade is when I noticed things hung and the above error.
I tried Restore etcd snapshot -- this simply hung. Now with a node also stuck in provisioning, and status of waiiting for rke2 to stop.
Deleting a working etcd node to see if it would atleast recreate. This hung.
Now down to 2 etcd nodes.
Started trying to see if what as in install.sh could explain the machine not found message. As well if I could create the rancher connection info somehow. Not really.
Noticed the k3s local cluster did let me hit 'configure', and had options for k3 versions. It would let me upgrade the k3s version. This then just hung the local cluster in upgrading ๐ Spamming jobs on the local cluster.
Tried to find docs on whether/how to upgrade rancher within vcluster, but didn't find anything. As I was thinking maybe this is some versioning thing where my guest cluster, or vcluster was too old and therefore not compatible.
Noticed the manifest used to enbale vcluster had newer versions defined if I curled it today. #yolo and curled the newer version and applied to the cluster. This result in rancher vlcuster being newer.
A newer version of kubernetes was shown on the guest cluster, so tried to upgrade. -- But the previous two hung operations were still hung.
.. at some point the cluster did show to be accessible again (explore), but only showed a single etcd node as a member.
Somewhere around now, I decided to start provisioning a new cluster and i'd start mounting the PVs to a VM to grab latest backups of key services. But to my horror, the PVs age was only 45 minutes. To more even more horror, they were 0 bytes.
During most of this, I could always create a 'new' guest cluster and it would provision. So it does just seem my existing guest cluster got lost along the way somehow.