This message was deleted.
# harvester
a
This message was deleted.
p
I can't even find this option for Linux VMs. Is this only available for Windows VMs?
I wasn't even able to run pg restore anymore. There is one table with 30GB, as soon as it does any larger operation on this table, the VM hard crashes. I needed to restore from a longhorn checkpoint. Please help!
As a side note: I have Disk I/O issues with Harvester from the beginning. The server is a DL360 Gen9 with a P440ar and 2x Samsung 870QV 4TB in RAID1. With both controller cache and disk cache disabled, I see only 2 MB/s (!) sequential write. With disk cache enabled it's about 150 MB/s. I can't enable the controller cache due to lack of a backup battery, but even then the write performance was too low. Even without a controller cache, two SATA SSDs in RAID1 shouldn't just yield 150MB/s sequential write. It should be something like 500 MB/s. Do I need to install a separate driver or anything to get this working properly?
a
You mentioned you can't find the option for setting reserved memory for VMs. What you do you see if you go to edit the VM's config, then select the "Advanced Options" tab on the left hand side? On that screen, under the "Run Strategy" box, there should be a blue "Show More" link, which you can click to access the Reserved Memory box.
p
Thank you. It was hidden under a small "show more" button. I've configured it and also set it as an annotation for the nodes in Rancher. I've set it to 265MB. The crashes still occur though.
I managed to get it working once by force enabling the raid controller cache (the server is on a UPS). However, it still doesn't work consistently. Either, the SSDs are literal e-waste or my raid controller can't handle their multiple levels of cache. I've been running similar consumer SSDs, Samsung 860EVO, on another HP Server for 5 years without a single issue and great performance. This problem is super hard to debug. I might just get rid of the SSDs and get new ones. 600 bucks down the drain...
The crash occurred during normal operation just now...
a
that's maybe not exactly the same situation, as I assume in your case it's not rebuilding whole replicas of volumes, but I can imagine a lot of writes having a similar effect
p
I don't think so. I didn't see anything about rebuilding. I only have a single node cluster for now, while migrating all workloads over from Hyper-V.
Yesterday I saw that the version of Ubuntu (20.04) was no longer supported with my rancher and RKE2 version. So I replaced it with Leap 15.5. Maybe this will also help. In the next days, I'll also replace the SSDs. If it's still failing then I have to assume it's a bug in Harvester. Even though the storage is slow and I/O load is high, this just shouldn't happen.
I had a look in the alert manager. The time matches exactly with the last crash:
It didn't fire before though, at least there was no email. All email alerts arrive as InfoInhibitor alerts for some reason...
This alert has appeared only after I increased the reserved memory. I have removed it yesterday while updating the node OS. Almost seems like increasing the reserved memory created yet another issue.
Here is a support bundle regarding the original problem before experimenting with reserved memory. The last crash appeared around 2024-03-12T02:49. You can find the timestamps of other crashes when searching for
Stopping VM with VMI in phase Failed
. The virt-controller detects that the VM is in a failed state. I wasn't able to find the reason for this failed state in the logs but I would appreciate if somebody else could have a look.
a
Sorry I wasn't able to get to this yesterday. I've had a look at the support bundle, and the crash at 2024-03-12T02:49 coincides with the OOM killer kicking in (see kernel.log inside nodes/ha01.zip in the support bundle):
Mar 12 02:49:13 ha01 kernel: memory: usage 17073492kB, limit 17073492kB, failcnt 6151604
- looks like something is capping out at 16GiB, if I'm reading that correctly?
how much reserved memory did you set subsequently?
can you try setting it higher and seeing if you still have problems?
p
That's interesting. The nodes are indeed set to 16 GB each. I was monitoring the memory usage with kubectl top nodes inside the guest cluster and it was never above 60%. Though, I noticed several times that the memory usage in Harvester is reported higher. It's often at around 95%. Is that because Harvester can't see what's reserved and what's used memory inside the guest? I set the reserved memory annotation to 256MB but it actually decreased stability. I had two crashes occur after that in normal operation. I wonder what the default value is so I can use that as a baseline.
Memory usage reported by Harvester (host view):
Memory usage reported by kubectl (guest view):
a
AIUI harvester can't look inside the guests
πŸ‘ 1
so the stats from inside the guests are just what the guest imagines is happening inside the VM
TBH I haven't looked at the stats stuff in great detail myself yet
I would suggest trying a larger amount of reserved RAM - go from 256MB to 512 or even 1GB, just as a test, and see if the behaviour changes
I suspect there's some weirdness between the amount of RAM you want for the guest, plus maybe some extra needed by qemu/libvirt/kubevirt, and the latter hitting a limit somewhere
p
I wonder if reserved memory is the total amount of this: https://docs.harvesterhci.io/v1.2/rancher/resource-quota/#overhead-memory-of-virtual-machine If so, then 256MB is actually lower than the default.
That would make sense. So there is a total memory limit which is actual VM memory plus additional overhead memory I guess.
πŸ‘ 1
a
ah! I somehow hadn't found that bit of docs yet myself πŸ˜•
p
Took me a while to find it πŸ˜„
I'll upgrade the storage today and keep monitoring the situation. So far it was stable during normal usage. If I see it crashing due to OOM, I'll set the reserved memory to 512M and monitor again.
a
cool, I'll be very interested to hear how you go
meanwhile, i have some reading to do about kubevirt memory limits and requests πŸ™‚
anyway, hopefully we're on the right track
p
Let's hope so and thank you for your help!
a
you're welcome!
p
I looked at the stats more and noticed unusually high memory consumption on the host. I have four VMs provisioned with 16GB each. That makes 64GB (plus some small overhead). The used memory on my host shows 87GB. That's an overhead of 23GB just to run Harvester.
a
Interesting. I've got a three-node harvester cluster running here, itself in VMs (i.e. just for test/dev, not production use), and harvester reports only about 7GB used per host
p
Looking at top on the node, It shows used memory of 91.2GB with 17.7GB of that being cache. That would result in 73.5GB actually used memory, which is still a very high overhead (roughly 9GB) but more like what I would expect. It's also weird that Harvester reports only 126GB on the UI.
For comparison my hyper-v host has an overhead of 5GB.
a
Just thinking aloud, I wonder if 126 vs 128 is rounding errors (/1024 vs /1000)
also if harvester is including cache in memory used, maybe it shouldn't do that as it's a bit misleading
πŸ‘ 1
I'll have to pick this up again next week, but might be worth opening a github issue or two
f
Saw this thread now... In my case, I don't have high load at all. The guest OS is just using more than the cgroup limit.... I don't really get why Kubevirt allows that in the first place.
p
But you also run into OOM like me. How's the memory usage inside the gusts looking?