dammit first upgrade that is stuck on me 1 5 1 gt 1 6 0 anyo Rancher Users #harvester

Join Slack

dammit, first upgrade that is stuck on me. 1.5.1 -...

# harvester

thousands-advantage-10804

08/28/2025, 1:50 AM

dammit, first upgrade that is stuck on me. 1.5.1 --> 1.6.0 anyone see an upgrade stuck at a node “post-drain” ?

bland-article-62755

08/28/2025, 4:48 AM

I have before. There's a job pod that runs and does the system update. If it's not run I've had a reboot unstuck it but I've also had it get it messed up.

hundreds-easter-25520

08/28/2025, 12:06 PM

I've got a similar problem. I'm stuck at the post-drain of the second node. It's trying to start a VM that hosts the upgrade repo, but the VM is missing? Reboot didn't fix it

thousands-advantage-10804

08/28/2025, 12:07 PM

same…

hundreds-easter-25520

08/28/2025, 12:10 PM

I saw the VM was stuck in being unable to start because the volume was missing. I checked and yes, the volume wasn't there. I removed the VMI, and the VM went away. I've been trying to figure out how to re-create it, but no luck so far.

hundreds-easter-25520

08/28/2025, 12:11 PM

I'm almost down to deleting the node and see if it moves on, but that seems like a bad idea in the middle of an upgrade? 🤷

hundreds-easter-25520

08/28/2025, 12:14 PM

Oh, overnight the job reached it's fail backoff limit, so it isn't even trying anymore

bland-article-62755

08/29/2025, 3:10 PM

Did either of you get your upgrades sorted?

thousands-advantage-10804

08/29/2025, 3:17 PM

nope.. it is a POC. it will get wiped soon. And they are testing the PX CSI and a pure array.

bland-article-62755

08/29/2025, 3:18 PM

Bummer

hundreds-easter-25520

08/29/2025, 8:01 PM

I got mine mostly recovered back to 1.5.1, but had a problem where etcd would start growing quickly, from ~80M to 2G in 10 minutes. I didn't feel like fighting with it anymore and just rebuilt

hundreds-easter-25520

08/29/2025, 8:03 PM

Once I got back to 1.5 I was able to get the "upgrade" button again, but that's when etcd started being weird

bland-article-62755

08/29/2025, 8:08 PM

yeesh

bland-article-62755

08/29/2025, 8:09 PM

I just upgraded our dev cluster to 1.5.1 but I haven't seen the button trigger yet.

hundreds-easter-25520

08/29/2025, 8:10 PM

I manually applied the upgrade.yaml. I should have waited 🙂

bland-article-62755

08/29/2025, 8:10 PM

The amount of times I've been burned because I should have waited.... lol

hundreds-easter-25520

08/29/2025, 8:10 PM

Me too. I've been doing this for long enough I should know better

hundreds-easter-25520

08/29/2025, 8:11 PM

I found a bug report (I've closed the tab) about the volume only having one replica and getting stuck. I'm wondering if it's the same thing, and if I had uncordoned the node if it would have fixed itself.

bland-article-62755

08/29/2025, 8:12 PM

Kinda sounds like it.

bland-article-62755

08/29/2025, 8:12 PM

Did you change your default replica count on your cluster before?

hundreds-easter-25520

08/29/2025, 8:13 PM

Yes, it was on two.

bland-article-62755

08/29/2025, 8:13 PM

Oh hm. Maybe that was part of it?

hundreds-easter-25520

08/29/2025, 8:13 PM

I had two nodes + witness, but then upgraded to 3 new nodes, but never changed it

bland-article-62755

08/29/2025, 8:13 PM

You probably could have found that volume in the Longhorn UI and bumped it up that way.

hundreds-easter-25520

08/29/2025, 8:14 PM

Another lesson I should have learned in all my years in ops: Don't try to debug this stuff late at night when it can wait until tomorrow.

hundreds-easter-25520

08/29/2025, 8:16 PM

I thought that deleting the VMI would allow the VM to be rescheduled on another node, but when I deleted the VMI the VM went away too. I don't know how that happened

hundreds-easter-25520

08/29/2025, 8:17 PM

When the VM went away, the PVC was deleted and I couldn't figure out how to restart the process to re-create the upgrade PVC. I tried deleting the status from the upgrade resource, and that reset the status in the GUI, but didn't seem to actually restart the process.

bland-article-62755

08/29/2025, 8:23 PM

bland-article-62755

08/29/2025, 8:24 PM

I know there's a way to trigger a restart (vs just deleting the VMI)

bland-article-62755

08/29/2025, 8:24 PM

but I can never remember what it is, so I normally trigger it from the UI and then kill the pod if it takes too long.

hundreds-easter-25520

08/29/2025, 8:24 PM

It was stuck in starting, so the VMI was in an invalid state

bland-article-62755

08/29/2025, 8:25 PM

It sounds like they're tried of these tickets though and they're switching to the backing image replacement for 1.7.0

hundreds-easter-25520

08/29/2025, 8:26 PM

Sounds good to me! This kind of issue is one of the things that is holding harvester back from being a viable alternative to VMware. It's improving quickly, but not there yet 🙂 I'm excided about Kube-OVN, but I think I'll wait until 1.7 to play with it

10 Views

Open in Slack

Previous Next