Yesterday a Virtual Machine rebooted unexpectedly ...
# harvester
s
Yesterday a Virtual Machine rebooted unexpectedly on a Harvester 1.4.3 deployment (16 hosts, HA setup). Luckily I had enough time to get the events, but before I could look at them further the events were deleted ( hour retention ). So I combed thru the metrics (node_exporter,kubevirt) but couldn't really find anything. I looked on the host saw nothing apparent in the logs. Wondering if anyone has insight in how to determine a cause.
Copy code
# kubectl get events -A | grep production
prod-k3s                 48m         Normal    SuccessfulCreate        virtualmachine/production-gw-1                              Started the virtual machine by creating the new virtual machine instance production-gw-1
prod-k3s                 48m         Warning   Stopped                 virtualmachineinstance/production-gw-1                      The VirtualMachineInstance crashed.
prod-k3s                 48m         Normal    Deleted                 virtualmachineinstance/production-gw-1                      Signaled Deletion
prod-k3s                 48m         Normal    SuccessfulDelete        virtualmachine/production-gw-1                              Stopped the virtual machine by deleting the virtual machine instance XXXXXXXXXXXXXXXXXXXXXXXXXXX
prod-k3s                 48m         Normal    SuccessfulDelete        virtualmachineinstance/production-gw-1                      Deleted PodDisruptionBudget kubevirt-disruption-budget-2n9t8
prod-k3s                 48m         Normal    SuccessfulDelete        virtualmachineinstance/production-gw-1                      Deleted virtual machine pod virt-launcher-production-gw-1-aaaa
prod-k3s                 48m         Normal    SuccessfulDelete        virtualmachineinstance/production-gw-1                      Deleted virtual machine pod virt-launcher-production-gw-1-bbbb
prod-k3s                 48m         Normal    SuccessfulCreate        virtualmachineinstance/production-gw-1                      Created virtual machine pod virt-launcher-production-gw-1-cccc
prod-k3s                 48m         Normal    Created                 virtualmachineinstance/production-gw-1                      VirtualMachineInstance defined.
prod-k3s                 48m         Normal    SuccessfulCreate        virtualmachineinstance/production-gw-1                      Created PodDisruptionBudget kubevirt-disruption-budget-xxxxxxxx/mu
h
Perhaps look at the VM logs rather than the hypervisor. Likely the VM crashed. I never see VM reboots due to the hypervisor unless the host itself has a failure, but you’d see the host would have a different uptime
b
Did it stop/start on the same node?
s
kube_pod_info
is showing two VirtualMachineInstances with the same created_by_name running for months, and they are on different hosts. After the reboot there is only one. The new VMI is on one of those nodes.
might be the result of my last upgrade. Seems like I have many VMs where two of these appear on the same date, but other VMs it's not the case.
b
Just making sure that the evictionStrategy on the VM isn't set to something that might explain it?
I've had nodes go squirrely and evicted a bunch of vms, some of which rebooted instead of live-migrate, but it sounds like what you saw might have been different. It could be at the VM level, but you would think that the VM would launch on the same node if it crash/rebooted at the OS level.
s
From the event log I figured it was Harvester detecting a VM crash
b
¯\_(ツ)_/¯
Could be oom too
You check alerts in Prometheus?
They're not always obvious.
s
Ya I didn't see any alerts, I got some custom ones. I checked the memory and it didn't look like a problem.. But it would be interesting to figure out why Harvester made that determination but I figure it just became unresponsive
b
I checked the memory
For the Pod? Host? Namespace? Project?
s
Using metrics for kubevirt and node exporter for the VM
b
I've seen discrepancies with what OOM sees and what the VM is reporting via snmp, but ymmv. I'm kinda out of ideas for you other than something on vm/kernel side or solarflares.
s
Ya I'm just going to document it and see if it happens again. Do you think anything is off with multiple VMI pods running for a single name? Or maybe it's just a cleanup issue with migration? I know longhorn volumes often need to be manually removed.
b
I've seen lots of stale references from migrations and upgrades
Some of them seem benign, while others, if there's multiple engines running or something, are more serious.
Or if crictl still has running references on a node that aren't in etcd.