We've been throwing nodes into maintenance mode to...
# harvester
b
We've been throwing nodes into maintenance mode to get some firmware updated and I've been noticing that a LOT of the vms haven't been able to migrate. It seems like on some of the VM, Longhorn has more than 1 engine running for the PVC volume. Deleting the engine on the draining node seems to fix it (but the VM gets paused). I can't help but wonder if there's a better way to fix it?
m
Do you have any VM that can be migrated? like the CSI driver is stick with certain nodes, or PCI device for example?
b
Nope.
Several of the VMs were downstream k8s nodes. No other storage other than LH.
At the same time this happened the cluster did a full blown reset afterwards and we lost a DB.
Luckily the downstream cluster had ceph and so we still had the files.
m
b
I've seen it hit that before, but these namespaces and projects didn't have any quotas set.
Literally the longhorn replicas showed 6 instead of 3 in the GUI and kept throwing an error stating that more than one engine was running.
m
Deleting the engine on the draining node seems to fix it (but the VM gets paused)
This could result in I/O errors. As far as I know, shutting down the VM is the safest option, if possible.
> Literally the longhorn replicas showed 6 instead of 3 in the GUI and kept throwing an error stating that more than one engine was running. Do you have the support bundle from when this happened?
b
No, I should have grabbed one.
I'll keep an eye out and make one if it happens again.
m
Based on your description, it seems the live migration failed on the Longhorn side.
b
Yep, I think so too, just not sure what caused it.
👀 1