This message was deleted Rancher Users #harvester

Join Slack

This message was deleted.

# harvester

adamant-kite-43734

09/27/2022, 11:39 AM

This message was deleted.

quick-sandwich-76600

09/27/2022, 11:45 PM

Hi Mark, it looks like a resource issue in the Longhorn layer so the VM is paused while the volume is degraded. How many nodes do you have? Have you been able to see any relevant message in the longhorn manager log?

sticky-summer-13450

09/28/2022, 7:27 PM

Hi @quick-sandwich-76600. It's a 3 node cluster of NUC like nodes, each has a 2TB NVME and they are connected through 2.5GB links.

sticky-summer-13450

09/28/2022, 8:01 PM

Obviously I mean 2.5Gb/s. I’ll look through the longhorn manager logs to see if I can find the start of “an event”. There’s a lot of stuff in the logs but I can’t currently find the start of an issue, so I might have to wait for it to happen again.

sticky-summer-13450

10/01/2022, 12:12 PM

It looks like an event in Harvester/Longhorn started this morning. Harvester has taken down two of the VMs, from around 8 in the morning. These are all the logs I can get with

kubectl logs longhorn-manager-<ID>ontext=harvester003 --namespace=longhorn-system > longhorn-manager-p9wv<ID>

quick-sandwich-76600

10/01/2022, 4:24 PM

Hi @sticky-summer-13450 I do see the resizing in the logs, but it should be normal depending on what exactly happened. The main problem I see is that one of the volume managers is not answering and giving a timeout. There may be various reasons for that but we do need other logs to have a more clear picture. May you please generate a support bundle (https://docs.harvesterhci.io/v0.3/troubleshooting/harvester/) and open a GitHub issue with the support bundle plus the info provided here?

sticky-summer-13450

10/01/2022, 4:47 PM

Yep - https://github.com/longhorn/longhorn/issues/4650

👍 1

sticky-summer-13450

10/01/2022, 4:54 PM

This is what's happening to one volume, continually (animated gif).

sticky-summer-13450

10/02/2022, 7:40 AM

I wonder if I should reboot all three nodes in the Harvester cluster, one after another, to see if that helps.

quick-sandwich-76600

10/02/2022, 8:11 PM

Reboot just may help temporarily... My gut feeling (but as I said, I never debugged a Harvester cluster) is that something is exhausting resources as I do see, etcd is having performance issues whit updates taking too much. That usually leads to general cluster issues.

sticky-summer-13450

10/03/2022, 8:25 AM

I see. If you want more information on the cluster I can provide it, but for a home-lab cluster it seems quite over-the-top. Three nodes with AMD Ryzen 7 4700U (8 core 8 thread) each with 64GB RAM and 2TB NVMe, networked with 2.5Gb/s ethernet through a dedicated MikroTik CRS305 - which has the up-link to the rest of the network. The workloads running are tiny - 3 agent nodes of a k3s cluster running my web sites, an mssql server with no clients and a nagios server for monitoring my home network. I have twice as much running on an old esxi host with 10th the CPU cores and a 12th the RAM.

sticky-summer-13450

10/04/2022, 10:49 AM

For what it's worth, I have rebooted all three nodes - one after the other (after migrating the running VMs, and cordoning and draining the nodes) and the situation is just as bad. I'm starting to wonder whether I have a hardware issue instead of a software issue, or whether the Kernel in Harvester does not like some component in the PN50 nodes.

sticky-summer-13450

10/04/2022, 11:44 AM

On reflection - since I was running Harvester 0.1.0 very successfully on one of the nodes, 0.3.0 very successfully on another of the nodes, and only started having issues more recently, if this does end up looking more like hardware then I'd lean towards this being a Kernel comparability issue than a physical hardware issue.

quick-sandwich-76600

10/04/2022, 11:49 AM

I don't know. The HW seems ok, but something is making etcd get timeouts and delays...

14 Views

Open in Slack

Previous Next