This message was deleted Rancher Users #harvester

Join Slack

This message was deleted.

# harvester

adamant-kite-43734

05/17/2024, 9:12 AM

This message was deleted.

prehistoric-balloon-31801

05/21/2024, 2:29 PM

cc @bland-farmer-13503

prehistoric-balloon-31801

05/21/2024, 2:30 PM

@enough-elephant-21781 so you didn't touch the vcluster add-on, neither upgrade the Rancher in vcluster or the guest cluster itself.

enough-elephant-21781

05/21/2024, 2:32 PM

I actually ended up reinstalling the cluster after making things worse. 🙂 😞 Happy to share what I did(likely wrong things), and what the cluster ended up as. But in the end managed to lose all the PVs, so at that point just rebuilt the cluster and restored from backups the PVs.

👀 1

enough-elephant-21781

05/21/2024, 2:45 PM

Yeah it was a sad Friday night. I didn't mean to lose the PVs is what made it the worse. I had backups, but not as recent as I should. Previous cluster failures I could always mount the PVs on a VM to recover latest data during a rebuild, even if I had lost the guest cluster. But this time I managed to lose all PVs. To be clear/fair, I run harvester at home to be more into the woods of k8s internals. So I do knowingly start expirementing/trying to solve things instead of a more cautious enterprise approach. I am not sure what exactly triggered the loss of the PVs though. A collection of random things I tried. Roughly what happened that night.. When making a sizing change to the nodes post upgrade is when I noticed things hung and the above error. I tried Restore etcd snapshot -- this simply hung. Now with a node also stuck in provisioning, and status of waiiting for rke2 to stop. Deleting a working etcd node to see if it would atleast recreate. This hung. Now down to 2 etcd nodes. Started trying to see if what as in install.sh could explain the machine not found message. As well if I could create the rancher connection info somehow. Not really. Noticed the k3s local cluster did let me hit 'configure', and had options for k3 versions. It would let me upgrade the k3s version. This then just hung the local cluster in upgrading 🙂 Spamming jobs on the local cluster. Tried to find docs on whether/how to upgrade rancher within vcluster, but didn't find anything. As I was thinking maybe this is some versioning thing where my guest cluster, or vcluster was too old and therefore not compatible. Noticed the manifest used to enbale vcluster had newer versions defined if I curled it today. #yolo and curled the newer version and applied to the cluster. This result in rancher vlcuster being newer. A newer version of kubernetes was shown on the guest cluster, so tried to upgrade. -- But the previous two hung operations were still hung. .. at some point the cluster did show to be accessible again (explore), but only showed a single etcd node as a member. Somewhere around now, I decided to start provisioning a new cluster and i'd start mounting the PVs to a VM to grab latest backups of key services. But to my horror, the PVs age was only 45 minutes. To more even more horror, they were 0 bytes. During most of this, I could always create a 'new' guest cluster and it would provision. So it does just seem my existing guest cluster got lost along the way somehow.

enough-elephant-21781

05/21/2024, 2:48 PM

@prehistoric-balloon-31801 and correct. When I ugpraded from 1.2.1 to 1.2.2 I did not touch vcluster rancher, nor my guest cluster. My guest cluster was already at the highest offered rke2 version before the upgrade. I did during the upgrade shut down the worker node VMs via harvester -- I tend to shut down the VMs on the node draining to speed up the upgrade as I do not mind the downtime and live mivration takes much longer. So whichever node is predraining I shutdown via harvester the VM son that node, then start them again. Which unblocks the draining node, and moves the VMs to another node . After the upgrade, I wanted to adjust the sizing of some of the node pools as I noticed was a bit tight to shut them down during the upgrade. That is when I had my guest cluster hang and I discovered the errors above.

prehistoric-balloon-31801

05/21/2024, 2:51 PM

Thanks for all these long typings and info. I'm not sure if bringing down a worker node trigger Rancher to reconcile the guest clusters or not. but maybe see if PoAn has some thought

6 Views

Open in Slack

Previous Next