Yesterday a Virtual Machine rebooted unexpectedly on a Harve Rancher Users #harvester

Yesterday a Virtual Machine rebooted unexpectedly ...

steep-teacher-58732

09/09/2025, 4:14 PM

Yesterday a Virtual Machine rebooted unexpectedly on a Harvester 1.4.3 deployment (16 hosts, HA setup). Luckily I had enough time to get the events, but before I could look at them further the events were deleted ( hour retention ). So I combed thru the metrics (node_exporter,kubevirt) but couldn't really find anything. I looked on the host saw nothing apparent in the logs. Wondering if anyone has insight in how to determine a cause.

steep-teacher-58732

09/09/2025, 4:19 PM

Copy code

# kubectl get events -A | grep production
prod-k3s                 48m         Normal    SuccessfulCreate        virtualmachine/production-gw-1                              Started the virtual machine by creating the new virtual machine instance production-gw-1
prod-k3s                 48m         Warning   Stopped                 virtualmachineinstance/production-gw-1                      The VirtualMachineInstance crashed.
prod-k3s                 48m         Normal    Deleted                 virtualmachineinstance/production-gw-1                      Signaled Deletion
prod-k3s                 48m         Normal    SuccessfulDelete        virtualmachine/production-gw-1                              Stopped the virtual machine by deleting the virtual machine instance XXXXXXXXXXXXXXXXXXXXXXXXXXX
prod-k3s                 48m         Normal    SuccessfulDelete        virtualmachineinstance/production-gw-1                      Deleted PodDisruptionBudget kubevirt-disruption-budget-2n9t8
prod-k3s                 48m         Normal    SuccessfulDelete        virtualmachineinstance/production-gw-1                      Deleted virtual machine pod virt-launcher-production-gw-1-aaaa
prod-k3s                 48m         Normal    SuccessfulDelete        virtualmachineinstance/production-gw-1                      Deleted virtual machine pod virt-launcher-production-gw-1-bbbb
prod-k3s                 48m         Normal    SuccessfulCreate        virtualmachineinstance/production-gw-1                      Created virtual machine pod virt-launcher-production-gw-1-cccc
prod-k3s                 48m         Normal    Created                 virtualmachineinstance/production-gw-1                      VirtualMachineInstance defined.
prod-k3s                 48m         Normal    SuccessfulCreate        virtualmachineinstance/production-gw-1                      Created PodDisruptionBudget kubevirt-disruption-budget-xxxxxxxx/mu

happy-cat-90847

09/09/2025, 4:19 PM

Perhaps look at the VM logs rather than the hypervisor. Likely the VM crashed. I never see VM reboots due to the hypervisor unless the host itself has a failure, but you’d see the host would have a different uptime

bland-article-62755

09/09/2025, 10:19 PM

Did it stop/start on the same node?

steep-teacher-58732

09/09/2025, 10:58 PM

kube_pod_info

is showing two VirtualMachineInstances with the same created_by_name running for months, and they are on different hosts. After the reboot there is only one. The new VMI is on one of those nodes.

steep-teacher-58732

09/09/2025, 11:02 PM

might be the result of my last upgrade. Seems like I have many VMs where two of these appear on the same date, but other VMs it's not the case.

bland-article-62755

09/09/2025, 11:03 PM

Just making sure that the evictionStrategy on the VM isn't set to something that might explain it?

bland-article-62755

09/09/2025, 11:05 PM

I've had nodes go squirrely and evicted a bunch of vms, some of which rebooted instead of live-migrate, but it sounds like what you saw might have been different. It could be at the VM level, but you would think that the VM would launch on the same node if it crash/rebooted at the OS level.

steep-teacher-58732

09/09/2025, 11:07 PM

From the event log I figured it was Harvester detecting a VM crash

bland-article-62755

09/09/2025, 11:07 PM

¯\_(ツ)_/¯

bland-article-62755

09/09/2025, 11:07 PM

Could be oom too

bland-article-62755

09/09/2025, 11:07 PM

You check alerts in Prometheus?

bland-article-62755

09/09/2025, 11:07 PM

They're not always obvious.

steep-teacher-58732

09/09/2025, 11:08 PM

Ya I didn't see any alerts, I got some custom ones. I checked the memory and it didn't look like a problem.. But it would be interesting to figure out why Harvester made that determination but I figure it just became unresponsive

bland-article-62755

09/09/2025, 11:09 PM

I checked the memory

For the Pod? Host? Namespace? Project?

steep-teacher-58732

09/09/2025, 11:09 PM

Using metrics for kubevirt and node exporter for the VM

bland-article-62755

09/09/2025, 11:12 PM

I've seen discrepancies with what OOM sees and what the VM is reporting via snmp, but ymmv. I'm kinda out of ideas for you other than something on vm/kernel side or solarflares.

steep-teacher-58732

09/09/2025, 11:14 PM

Ya I'm just going to document it and see if it happens again. Do you think anything is off with multiple VMI pods running for a single name? Or maybe it's just a cleanup issue with migration? I know longhorn volumes often need to be manually removed.

bland-article-62755

09/09/2025, 11:17 PM

I've seen lots of stale references from migrations and upgrades

bland-article-62755

09/09/2025, 11:18 PM

Some of them seem benign, while others, if there's multiple engines running or something, are more serious.

bland-article-62755

09/09/2025, 11:18 PM

Or if crictl still has running references on a node that aren't in etcd.

8 Views

Open in Slack

Previous Next