dammit, first upgrade that is stuck on me. 1.5.1 -...
# harvester
t
dammit, first upgrade that is stuck on me. 1.5.1 --> 1.6.0 anyone see an upgrade stuck at a node “post-drain” ?
b
I have before. There's a job pod that runs and does the system update. If it's not run I've had a reboot unstuck it but I've also had it get it messed up.
h
I've got a similar problem. I'm stuck at the post-drain of the second node. It's trying to start a VM that hosts the upgrade repo, but the VM is missing? Reboot didn't fix it
t
same…
h
I saw the VM was stuck in being unable to start because the volume was missing. I checked and yes, the volume wasn't there. I removed the VMI, and the VM went away. I've been trying to figure out how to re-create it, but no luck so far.
I'm almost down to deleting the node and see if it moves on, but that seems like a bad idea in the middle of an upgrade? 🤷
Oh, overnight the job reached it's fail backoff limit, so it isn't even trying anymore
b
Did either of you get your upgrades sorted?
t
nope.. it is a POC. it will get wiped soon. And they are testing the PX CSI and a pure array.
b
Bummer
h
I got mine mostly recovered back to 1.5.1, but had a problem where etcd would start growing quickly, from ~80M to 2G in 10 minutes. I didn't feel like fighting with it anymore and just rebuilt
Once I got back to 1.5 I was able to get the "upgrade" button again, but that's when etcd started being weird
b
yeesh
I just upgraded our dev cluster to 1.5.1 but I haven't seen the button trigger yet.
h
I manually applied the upgrade.yaml. I should have waited 🙂
b
The amount of times I've been burned because I should have waited.... lol
h
Me too. I've been doing this for long enough I should know better
I found a bug report (I've closed the tab) about the volume only having one replica and getting stuck. I'm wondering if it's the same thing, and if I had uncordoned the node if it would have fixed itself.
b
Kinda sounds like it.
Did you change your default replica count on your cluster before?
h
Yes, it was on two.
b
Oh hm. Maybe that was part of it?
h
I had two nodes + witness, but then upgraded to 3 new nodes, but never changed it
b
You probably could have found that volume in the Longhorn UI and bumped it up that way.
h
Another lesson I should have learned in all my years in ops: Don't try to debug this stuff late at night when it can wait until tomorrow.
I thought that deleting the VMI would allow the VM to be rescheduled on another node, but when I deleted the VMI the VM went away too. I don't know how that happened
When the VM went away, the PVC was deleted and I couldn't figure out how to restart the process to re-create the upgrade PVC. I tried deleting the status from the upgrade resource, and that reset the status in the GUI, but didn't seem to actually restart the process.
b
Hm
I know there's a way to trigger a restart (vs just deleting the VMI)
but I can never remember what it is, so I normally trigger it from the UI and then kill the pod if it takes too long.
h
It was stuck in starting, so the VMI was in an invalid state
b
It sounds like they're tried of these tickets though and they're switching to the backing image replacement for 1.7.0
h
Sounds good to me! This kind of issue is one of the things that is holding harvester back from being a viable alternative to VMware. It's improving quickly, but not there yet 🙂 I'm excided about Kube-OVN, but I think I'll wait until 1.7 to play with it