Hi Mark, it looks like a resource issue in the Longhorn layer so the VM is paused while the volume is degraded. How many nodes do you have? Have you been able to see any relevant message in the longhorn manager log?
09/28/2022, 7:27 PM
Hi @quick-sandwich-76600. It's a 3 node cluster of NUC like nodes, each has a 2TB NVME and they are connected through 2.5GB links.
Obviously I mean 2.5Gb/s.
I’ll look through the longhorn manager logs to see if I can find the start of “an event”. There’s a lot of stuff in the logs but I can’t currently find the start of an issue, so I might have to wait for it to happen again.
10/01/2022, 4:24 PM
Hi @sticky-summer-13450 I do see the resizing in the logs, but it should be normal depending on what exactly happened. The main problem I see is that one of the volume managers is not answering and giving a timeout. There may be various reasons for that but we do need other logs to have a more clear picture. May you please generate a support bundle (https://docs.harvesterhci.io/v0.3/troubleshooting/harvester/) and open a GitHub issue with the support bundle plus the info provided here?
I wonder if I should reboot all three nodes in the Harvester cluster, one after another, to see if that helps.
10/02/2022, 8:11 PM
Reboot just may help temporarily... My gut feeling (but as I said, I never debugged a Harvester cluster) is that something is exhausting resources as I do see, etcd is having performance issues whit updates taking too much. That usually leads to general cluster issues.
10/03/2022, 8:25 AM
If you want more information on the cluster I can provide it, but for a home-lab cluster it seems quite over-the-top. Three nodes with AMD Ryzen 7 4700U (8 core 8 thread) each with 64GB RAM and 2TB NVMe, networked with 2.5Gb/s ethernet through a dedicated MikroTik CRS305 - which has the up-link to the rest of the network.
The workloads running are tiny - 3 agent nodes of a k3s cluster running my web sites, an mssql server with no clients and a nagios server for monitoring my home network. I have twice as much running on an old esxi host with 10th the CPU cores and a 12th the RAM.
For what it's worth, I have rebooted all three nodes - one after the other (after migrating the running VMs, and cordoning and draining the nodes) and the situation is just as bad.
I'm starting to wonder whether I have a hardware issue instead of a software issue, or whether the Kernel in Harvester does not like some component in the PN50 nodes.
On reflection - since I was running Harvester 0.1.0 very successfully on one of the nodes, 0.3.0 very successfully on another of the nodes, and only started having issues more recently, if this does end up looking more like hardware then I'd lean towards this being a Kernel comparability issue than a physical hardware issue.
10/04/2022, 11:49 AM
I don't know. The HW seems ok, but something is making etcd get timeouts and delays...