Hi! Is this the right place to get Harvester suppo...
# harvester
c
Hi! Is this the right place to get Harvester support? We have a Harvester cluster that is stuck in upgrade from 1.3.2 to 1.4.0 at "pre-drained" of first node.
a
There are some "known issues" that are documented in the upgrade guides. One of them matches what you are observing: Upgrade from v1.3.2 to v1.4.0 | Harvester
c
Thanks, I saw that, but I don't think it matches our case exactly. We shut down all VMs prior to upgrading, to avoid issues with live migration. We also did the recommended update of Longhorn as detailed in the preparation steps. Also, I think this issue shows up as being stuck in "pre-draining", while we are in "pre-drained". I read about possible issues with volume synchronisation, but all volumes are healthy in the Longhorn UI. I wanted to stop the upgrade and try it again, but it won't let me delete the upgrade resource because a node upgrade is in progress (but stuck). Is there any other way of stopping/retrying this manually?
a
Yeah, that makes sense. What has helped me in the past when an upgrade became "stuck" (short of simply rebuilding the cluster) was visiting the Harvester GitHub and searching the issues. In general, other folks bumped into the same issue as I experienced and the Suse team or the community provided some helpful advice. I seem to recall one specific issue that I had involved simply disabling the Longhorn v2 data engine to get past a stuck upgrade. Another was due to the fact that I had one node with a different processor than the others and the upgrade VM was unable to move to other nodes as a result. In both of those cases, the GitHub issues were very helpful. I know this (directing you to the GitHub issues page) doesn't directly resolve your issue, but it might get you pointed in the right direction.
c
Thanks, I will do a deep dive there, then.
I did find the issue: our certificates for kube-controller-manager and kube-scheduler had expired. Once I rotated them, the upgrade resumed and completed without further issues. Thanks again for pointing me to the Github page!
👍 1
b
There is a settings option in more recent Havester versions to rotate and restart RKE2 services automatically. Sadly it's off by default.