I noticed a VM restarted and there was nothing in ...
# harvester
p
I noticed a VM restarted and there was nothing in the rke2-server logs. When I checked dmesg though, I found an OOM error
Copy code
[11467358.262783] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=cri-containerd-f9d2ff8b835dbf910c5f761bd7bdae95ce7685efbd07876b95556d1ea16d78ed.scope,mems_allowed=0-1,oom_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-poda6998669_cc33_4fd5_a34d_0368396d3afe.slice/cri-containerd-f9d2ff8b835dbf910c5f761bd7bdae95ce7685efbd07876b95556d1ea16d78ed.scope,task_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-poda6998669_cc33_4fd5_a34d_0368396d3afe.slice/cri-containerd-f9d2ff8b835dbf910c5f761bd7bdae95ce7685efbd07876b95556d1ea16d78ed.scope,task=qemu-system-x86,pid=10476,uid=107

[11467358.262870] Memory cgroup out of memory: Killed process 10476 (qemu-system-x86) total-vm:51480644kB, anon-rss:49988500kB, file-rss:5532kB, shmem-rss:4kB, UID:107 pgtables:98416kB oom_score_adj:897
The VM has 12 cores, 48GB RAM
Copy code
<http://harvesterhci.io/reservedMemory|harvesterhci.io/reservedMemory>: 512Mi
I added this. Given the huge amount of RAM (it's an Oracle db), should I increase this value? The docs only mention 512 if 256 still causes issues
The VM uses some 20GB at "idle" so it's not like the guest OS ran out of memory either
It happened at like 10pm, maybe coinciding with high disk activity from backups going on at that time?
Though, from the graphs, there's nothing extraordinary happening (2:30 on the graphs is when the VM crash happened)
Given it's an OOM issue, I think the reservedMemory issue might be relevant, despite the docs saying to try 512MB
Copy code
A VM that is configured to have 1 CPU, 64 Gi Memory, 1 Volume and 1 NIC will get around 250 Mi Memory Overhead when the ratio is "1.0". The VM memory size does not have a big influence on the computing of Memory Overhead. The overhead of guest OS pagetables needs one bit for every 512b of RAM size.
Yet VM memory doesn't have a big influence - what does? CPU? Number and size of attached volumes? IO activity? My hypothesis is that it's still a reservedMemory thing, but I'll wait for a second opinion
e
If you see VMs getting OOM-killed but the node isn't under memory pressure at any point, then it's highly likely the problem is too little reserved memory. If the guest OS runs out of memory, it won't show up in the node's logs anywhere because that's a situation that is largely invisible to the hypervisor. It's pretty hard to make an accurate estimate how much memory needs to be reserved for the qemu and virt-launcher overhead, because it seems to depend on lots of factors and also seems to vary with workloads within the VM as well. The current approach is that it's just a fixed amount that is likely to work for most use cases. It's skewed a bit towards the lean side to avoid being too wasteful for small VMs. But as a result very large VMs tend to run out of memory. If you can afford it, increase the reserved memory for that VM to 1Gi. Having "too much" reserved memory won't cause any problems.
p
Alright, awesome! Thank you a ton. I will put it to 2GB as RAM is something I have plenty of
I was going to do that, and I am grateful for you supporting my idea πŸ˜„
e
Here is some docs for KubeVirt memory pressure behavior that may be interesting for you: https://kubevirt.io/user-guide/compute/node_overcommit/#configuring-the-memory-pressure-behavior-of-nodes
🐿️ 1
p
Ah, thank you
Though I don't think this is caused by the hosts being OOM
The windows VM itself probably didn't run out of memory because otherwise I would not have had an OOM error in DMESG on the host. Another thing is that this happened in the middle of the night when there was absolutely no load on the VM (other than Oracle just spinning). All this lends me to the reserved memory value. However, I was curious, is this possibly a symptom of heavy disk IO? I wouldn't think so, and the graph does not suggest high IO. Though, I am always cautious as these are hard disks. But at 2:30 in the morning, there's literally nothing else running on the cluster other than backups.
b
Check https://docs.harvesterhci.io/v1.6/advanced/index/#additional-guest-memory-overhead-ratio . The default of 1.5 if far too low. We saw OOM kills even with 2.5. Now trying 4.5...
πŸ‘€ 1
p
Oh wow, okay
The only VMs I've seen this on so far are the two with 48G of RAM each
b
AFAIK it's not the VM size but the "level" of I/O. E.g. lots of disk and network traffic, number of devices for buffers etc.
p
Ah now that's interesting. Coincidentally, the big VMs also do the most IO (like in this case, an Oracle DB which is allergic to uptime)
I wonder if it's more relevant in my case, given I'm doing the very-not-recommended thing of running on hard disks and not SSDs
b
Maybe. Maybe not. We see this also with NVMEs as backing storage. The good thing: You can correct this on-the-fly, simply by changing the overhead value and do a life migration to another node.
e
Oracle DB which is allergic to uptime
This is gold, thanks for the laugh 🀣 As I understand it, virtual machines will not consume memory unless it is written to. This means that allocating 48Gi of memory to a VM does not immediately allocate all of the 48Gi of memory, but rather in chunks over time, as the workload within the VM dirties the memory pages. The hypervisor can easily determine if a memory page has never been written to. Therefore on startup the hypervisor (i.e. the qemu process) only needs as much memory as is actually being used in the VM, since the clean memory pages don't yet need to be allocated. Over time the hypervisor will then increase it's memory footprint as more and more memory pages in the VM get dirtied. However the hypervisor can not (by itself) see when memory pages in the VM are freed up again. Therefore its memory footprint grows until it holds all of the VM's memory - plus its own overhead of course. This means that a VM with a workload that doesn't fully utilize the entire memory capacity of the VM will probably never see OOMs. But a VM with the same resources allocated may see OOMs if the workload in the VM churns memory a lot[1]. The remedy to that is memory ballooning, where a special virtual device is installed in the VM and the guest OS has a special driver for this device. The memory ballooning driver will allocate free memory inside the virtual machine and communicate via the virtual ballooning device to the hypervisor that this memory is free and it's ok to release it. Otherwise the hypervisor can't ever shrink the actual memory footprint of a VM. Qemu and Libvirt support memory ballooning, however to my knowledge KubeVirt and Harvester don't. Therefore it's necessary that the virt-launcher Pod, which contains the qemu process that is the VM, has sufficient memory resources assigned to cover the entirety of the VM itself plus the overhead, because otherwise it will eventually run out of memory. Now here's the problem: It's pretty hard to determine what the overhead is as well, because this depends on the VM's configuration. It's quite reasonable to assume that more virtual devices require more memory overhead of the hypervisor, but how much exactly is not obvious. I'd assume that it doesn't take long at all until all 48Gi of memory of your VM have been dirtied, if you're doing lots of IO. Once that's the case, the qemu process will hold 48Gi of memory. Then only the reserved memory remains for the hypervisor process and its helpers itself (i.e. for qemu, virt-launcher...). Note that the reserved memory setting is only applied if the
additional-guest-memory-overhead-ratio
is
0
, otherwise the overhead-ratio will be used. [1]: Lots of disk IO usually means lots of memory churn as well, because the guest OS kernel will utilize that memory for filesystem caching. Here is an example from my workstation:
Copy code
β”‚ ~ β”‚β–Ί free -m
               total        used        free      shared  buff/cache   available
Mem:           63987       12854        6338        1695       47207       51133
You can see the
free
indicates that only ~12Gi are actually allocated, however including buffers and caches there are almost 60Gi "in use". Now my OS could free up those buffers and caches if required, hence there are still 51Gi "available". But since all of that memory is dirty pages, if this was a virtual machine, the hypervisor would need to hold all of that in memory, since it can't tell which pages belong to the 12Gi in use and which are just cache.
b
This is not about memory used INSIDE the VM! It's about the memory outside of the VM, but inside the pod for virt-launcher, qemu, buffers etc. You can be lucky if the VM is not using all assigned memory, and don't see OOM kills for a long time. But if the VM is starting to use their configured memory almost completely, then the additional overhead is saving your (VM) live, or running straight into OOM kill.
p
Aha, I get it!
That makes sense
Just one thing, you said that reserved memory setting is only applied if the additional guest memory overhead ratio is 0. Though the Harvester docs (at least 1.4.0) say this Harvester adds a Reserved Memory field for users to adjust the guest OS memory and the final Total Memory Overhead. A proper Total Memory Overhead can help the VM to eliminate the chance of hitting OOM. The Total Memory Overhead = automatically computed Memory Overhead + Harvester Reserved Memory. https://docs.harvesterhci.io/v1.4/vm/index/#reserved-memory
Oh but checking your link says that's only the case for 1.3.0 and earlier
b
No. I didn't say that πŸ™‚ As you see in the table, the reserved memory is subtracted from the VM memory (not touching the pod memory limit), but the additional overhead memory is added to the pod limit (not touching the VM memory size). I guess the reason for this is, that the additional overhead is a Harvester only config, only reserved memory is available in upstream Kubevirt. Which means we have 2 screws to use. Ask Suse why there are 2 ways for doing the same thing.
p
TwT
I set the now-legacy reserved memory setting to 2GB. So that should have the effect I'm looking for, right?
b
I guess so. Nobody can tell you for sure. You'll see that the actual VM sees 2GB memory less. And you need to restart your VM to put that into place. If you set the overhead memory to any higher value, the VM memory size does not change, but only the pod memory limit. You can "activate" this on-the-fly by doing a life-migration to another Harvester node... if downtime matters. I prefer the overhead setting, because it's visibility to users is better. The VM memory size is 4GB inside the VM and in the Harvester UI. The values differ, if you're using reserved memory. I think this is confusing.
p
Right, and you said you're on an overhead factor of 4 right now. Will set that as well
b
You need to find the "right" value yourself. We're also still looking for a value which (really) never oomkills VMs. Good luck! πŸ™‚
p
Thank you! 🫑
πŸ‘ 1
e
This is not about memory used INSIDE the VM! It's about the memory outside of the VM, but inside the pod for virt-launcher, qemu, buffers etc.
You
can be lucky if the VM is not using all assigned memory, and don't see
OOM kills for a long time. But if the VM is starting to use their
configured memory almost completely, then the additional overhead is
saving your (VM) live, or running straight into OOM kill.
The memory inside the VM influences the memory consumption of the hypervisor, because if a memory page is never dirtied, the hypervisor never needs to allocate it. But if a memory page is dirtied only once, the hypervisor needs to allocate it until the VM is restarted (unless something like memory ballooning is used). Therefore memory behavior inside the VM influences indirectly memory behavior of the Pod and therefore has an influence on whether or not you see OOMs when there is too little overhead accounted for. The cumulative memory consumption inside the VM is a lower bound for the memory requirement of the Pod
Ask Suse why there are 2 ways for doing the same thing.
Because it's not the same thing. When you define a VM in Harvester, you give one number for how much memory it should have. If you consider that number to include the hypervisor overhead, then the VM doesn't get as much memory as you assigned. Some people consider this to be unacceptable. Conversely you consider that number to be the guaranteed amount of memory that the VM is going to get, i.e. to exclude the hypervisor overhead, then the hypervisor overhead is going to push the total memory consumption of that VM above the specified number. Some people consider this to be unacceptable as well. By giving two options, you have the choice. Do you care about never exceeding the amount of memory you specify for a VM, or do you care about guaranteeing the amount of memory you specify. It's up to you to look at your use case and figure out which one is better for you.
b
In the end it's the same thing. Increasing the pod limit gives more overhead memory by adding it on top of the VM memory, reserved memory is adding overhead memory by taking it away from the VM. From the namespace perspective it doesn't matter. If memory (used memory of your VM + overhead memory for VM management) runs out, the pod, and therefore your VM, is oomkilled.