This message was deleted Rancher Users #harvester

Join Slack

This message was deleted.

# harvester

adamant-kite-43734

03/12/2024, 1:59 AM

This message was deleted.

polite-zoo-10823

03/12/2024, 2:27 AM

I can't even find this option for Linux VMs. Is this only available for Windows VMs?

polite-zoo-10823

03/12/2024, 2:54 AM

I wasn't even able to run pg restore anymore. There is one table with 30GB, as soon as it does any larger operation on this table, the VM hard crashes. I needed to restore from a longhorn checkpoint. Please help!

polite-zoo-10823

03/12/2024, 3:37 AM

As a side note: I have Disk I/O issues with Harvester from the beginning. The server is a DL360 Gen9 with a P440ar and 2x Samsung 870QV 4TB in RAID1. With both controller cache and disk cache disabled, I see only 2 MB/s (!) sequential write. With disk cache enabled it's about 150 MB/s. I can't enable the controller cache due to lack of a backup battery, but even then the write performance was too low. Even without a controller cache, two SATA SSDs in RAID1 shouldn't just yield 150MB/s sequential write. It should be something like 500 MB/s. Do I need to install a separate driver or anything to get this working properly?

ambitious-daybreak-95996

03/12/2024, 7:44 AM

You mentioned you can't find the option for setting reserved memory for VMs. What you do you see if you go to edit the VM's config, then select the "Advanced Options" tab on the left hand side? On that screen, under the "Run Strategy" box, there should be a blue "Show More" link, which you can click to access the Reserved Memory box.

polite-zoo-10823

03/12/2024, 10:06 AM

Thank you. It was hidden under a small "show more" button. I've configured it and also set it as an annotation for the nodes in Rancher. I've set it to 265MB. The crashes still occur though.

polite-zoo-10823

03/12/2024, 12:09 PM

I managed to get it working once by force enabling the raid controller cache (the server is on a UPS). However, it still doesn't work consistently. Either, the SSDs are literal e-waste or my raid controller can't handle their multiple levels of cache. I've been running similar consumer SSDs, Samsung 860EVO, on another HP Server for 5 years without a single issue and great performance. This problem is super hard to debug. I might just get rid of the SSDs and get new ones. 600 bucks down the drain...

polite-zoo-10823

03/13/2024, 12:09 AM

The crash occurred during normal operation just now...

ambitious-daybreak-95996

03/13/2024, 4:57 AM

I wonder if it's an issue with replication traffic? See e.g. https://rancher-users.slack.com/archives/C01GKHKAG0K/p1709651023555249?thread_ts=1709608621.442679&cid=C01GKHKAG0K

ambitious-daybreak-95996

03/13/2024, 4:59 AM

that's maybe not exactly the same situation, as I assume in your case it's not rebuilding whole replicas of volumes, but I can imagine a lot of writes having a similar effect

polite-zoo-10823

03/13/2024, 9:55 AM

I don't think so. I didn't see anything about rebuilding. I only have a single node cluster for now, while migrating all workloads over from Hyper-V.

polite-zoo-10823

03/13/2024, 9:58 AM

Yesterday I saw that the version of Ubuntu (20.04) was no longer supported with my rancher and RKE2 version. So I replaced it with Leap 15.5. Maybe this will also help. In the next days, I'll also replace the SSDs. If it's still failing then I have to assume it's a bug in Harvester. Even though the storage is slow and I/O load is high, this just shouldn't happen.

polite-zoo-10823

03/13/2024, 10:13 AM

I had a look in the alert manager. The time matches exactly with the last crash:

polite-zoo-10823

03/13/2024, 10:26 AM

It didn't fire before though, at least there was no email. All email alerts arrive as InfoInhibitor alerts for some reason...

polite-zoo-10823

03/13/2024, 10:35 AM

This alert has appeared only after I increased the reserved memory. I have removed it yesterday while updating the node OS. Almost seems like increasing the reserved memory created yet another issue.

polite-zoo-10823

03/13/2024, 11:07 AM

Here is a support bundle regarding the original problem before experimenting with reserved memory. The last crash appeared around 2024-03-12T02:49. You can find the timestamps of other crashes when searching for

Stopping VM with VMI in phase Failed

. The virt-controller detects that the VM is in a failed state. I wasn't able to find the reason for this failed state in the logs but I would appreciate if somebody else could have a look.

supportbundle_e80da089-37b8-43f0-ac7b-b4bc19979427_2024-03-12T03-07-30Z.zip

ambitious-daybreak-95996

03/15/2024, 6:07 AM

Sorry I wasn't able to get to this yesterday. I've had a look at the support bundle, and the crash at 2024-03-12T02:49 coincides with the OOM killer kicking in (see kernel.log inside nodes/ha01.zip in the support bundle):

Mar 12 02:49:13 ha01 kernel: memory: usage 17073492kB, limit 17073492kB, failcnt 6151604

- looks like something is capping out at 16GiB, if I'm reading that correctly?

ambitious-daybreak-95996

03/15/2024, 6:07 AM

how much reserved memory did you set subsequently?

ambitious-daybreak-95996

03/15/2024, 6:08 AM

can you try setting it higher and seeing if you still have problems?

polite-zoo-10823

03/15/2024, 7:40 AM

That's interesting. The nodes are indeed set to 16 GB each. I was monitoring the memory usage with kubectl top nodes inside the guest cluster and it was never above 60%. Though, I noticed several times that the memory usage in Harvester is reported higher. It's often at around 95%. Is that because Harvester can't see what's reserved and what's used memory inside the guest? I set the reserved memory annotation to 256MB but it actually decreased stability. I had two crashes occur after that in normal operation. I wonder what the default value is so I can use that as a baseline.

polite-zoo-10823

03/15/2024, 9:11 AM

Memory usage reported by Harvester (host view):

polite-zoo-10823

03/15/2024, 9:11 AM

Memory usage reported by kubectl (guest view):

ambitious-daybreak-95996

03/15/2024, 9:16 AM

AIUI harvester can't look inside the guests

👍 1

ambitious-daybreak-95996

03/15/2024, 9:17 AM

so the stats from inside the guests are just what the guest imagines is happening inside the VM

ambitious-daybreak-95996

03/15/2024, 9:18 AM

TBH I haven't looked at the stats stuff in great detail myself yet

ambitious-daybreak-95996

03/15/2024, 9:18 AM

I would suggest trying a larger amount of reserved RAM - go from 256MB to 512 or even 1GB, just as a test, and see if the behaviour changes

ambitious-daybreak-95996

03/15/2024, 9:19 AM

I suspect there's some weirdness between the amount of RAM you want for the guest, plus maybe some extra needed by qemu/libvirt/kubevirt, and the latter hitting a limit somewhere

polite-zoo-10823

03/15/2024, 9:19 AM

I wonder if reserved memory is the total amount of this: https://docs.harvesterhci.io/v1.2/rancher/resource-quota/#overhead-memory-of-virtual-machine If so, then 256MB is actually lower than the default.

polite-zoo-10823

03/15/2024, 9:21 AM

That would make sense. So there is a total memory limit which is actual VM memory plus additional overhead memory I guess.

👍 1

ambitious-daybreak-95996

03/15/2024, 9:21 AM

ah! I somehow hadn't found that bit of docs yet myself 😕

polite-zoo-10823

03/15/2024, 9:21 AM

Took me a while to find it 😄

polite-zoo-10823

03/15/2024, 9:27 AM

I'll upgrade the storage today and keep monitoring the situation. So far it was stable during normal usage. If I see it crashing due to OOM, I'll set the reserved memory to 512M and monitor again.

ambitious-daybreak-95996

03/15/2024, 9:27 AM

cool, I'll be very interested to hear how you go

ambitious-daybreak-95996

03/15/2024, 9:28 AM

meanwhile, i have some reading to do about kubevirt memory limits and requests 🙂

ambitious-daybreak-95996

03/15/2024, 9:28 AM

anyway, hopefully we're on the right track

polite-zoo-10823

03/15/2024, 9:29 AM

Let's hope so and thank you for your help!

ambitious-daybreak-95996

03/15/2024, 9:29 AM

you're welcome!

polite-zoo-10823

03/15/2024, 9:51 AM

I looked at the stats more and noticed unusually high memory consumption on the host. I have four VMs provisioned with 16GB each. That makes 64GB (plus some small overhead). The used memory on my host shows 87GB. That's an overhead of 23GB just to run Harvester.

ambitious-daybreak-95996

03/15/2024, 10:02 AM

Interesting. I've got a three-node harvester cluster running here, itself in VMs (i.e. just for test/dev, not production use), and harvester reports only about 7GB used per host

polite-zoo-10823

03/15/2024, 10:12 AM

Looking at top on the node, It shows used memory of 91.2GB with 17.7GB of that being cache. That would result in 73.5GB actually used memory, which is still a very high overhead (roughly 9GB) but more like what I would expect. It's also weird that Harvester reports only 126GB on the UI.

polite-zoo-10823

03/15/2024, 10:16 AM

For comparison my hyper-v host has an overhead of 5GB.

ambitious-daybreak-95996

03/15/2024, 10:17 AM

Just thinking aloud, I wonder if 126 vs 128 is rounding errors (/1024 vs /1000)

ambitious-daybreak-95996

03/15/2024, 10:17 AM

also if harvester is including cache in memory used, maybe it shouldn't do that as it's a bit misleading

👍 1

ambitious-daybreak-95996

03/15/2024, 10:18 AM

I'll have to pick this up again next week, but might be worth opening a github issue or two

fast-plumber-26155

03/15/2024, 1:13 PM

Saw this thread now... In my case, I don't have high load at all. The guest OS is just using more than the cgroup limit.... I don't really get why Kubevirt allows that in the first place.

polite-zoo-10823

03/15/2024, 1:38 PM

But you also run into OOM like me. How's the memory usage inside the gusts looking?

2 Views

Open in Slack

Previous Next