This message was deleted.
# rke2
a
This message was deleted.
b
No, that’s because what you’re looking at are different things. Capacity is a nodes overall capacity in memory and cpu. Allocation is what you generally can allocate to pod ressources after system overhead is taken into account. This meaning. If some pods have some reserved memory and cpu. This will not be taken into account on these fields. The reason these are exactly the same in your situation, is because you allocated all the nodes ressources to be allocatable. This is generally not recommended, as the underlying system should have some overhead. This could potentially lead to a system crash, as all the nodes ressources could be used. To fix this, you can look into system-reserved and kube-reserved flags inside your config.
g
@bland-appointment-12982 you are great. Thank you for that. How about this configrations ?
# /etc/rancher/rke2/config.yaml
kubelet-arg:
- "cpu-manager-policy=static"
- "cpu-manager-reconcile-period=5s"
- "topology-manager-policy=best-effort"
- "topology-manager-scope=container"
- "kube-reserved=cpu=500m,memory=1Gi,ephemeral-storage=1Gi"
- "system-reserved=cpu=500m,memory=1Gi,ephemeral-storage=1Gi"
- "reserved-system-cpus=0,1"
- "eviction-hard=memory.available<500Mi,nodefs.available<1Gi,imagefs.available<15Gi"
- "eviction-soft=memory.available<1Gi,nodefs.available<5Gi,imagefs.available<20Gi"
- "eviction-soft-grace-period=memory.available=1m,nodefs.available=1m,imagefs.available=1m"
b
Looks about right. I don't know your specs or anything. But, i would recommend reserving about 10% of cpu, memory and storage for your system, and leave the rest up for grab
b
Does altering eviction-hard work? We are looking into tweaking these settings at the moment and saw it's dropped in the pebinary https://github.com/rancher/rke2/blob/61cebdaf4b649351142221f85302f9292e1aa275/pkg/pebinaryexecutor/pebinary.go#L162
b
Eviction-hard essentially exists to keep the node healthy, and not overcommitted in terms of ressources.
b
Sorry I may have hijacked the thread. We are looking at eviction-hard because a node just went down due to OOM despite all pods having memory limits set and the total memory limit cluster commitment being <100%.
b
I'm not sure what caused it, as i can't see it. So you guys gave some sort of monitoring on it? On top of my head, it seems like pod eviction policies might be misconfigured. Or even, have you looked into node affinity and node anti-affinity rules? If so, these can also cause resource imbalances.
Could also be that deployments, daemons or stateful sets are misconfigured, without some sort of backoff mechanism, no pod disruption budget etc.
g
@brainy-kilobyte-33711 Can yo confirm your configration ? curl -X GET http://127.0.0.1:8001/api/v1/nodes/<node-name>/proxy/configz | jq .