This message was deleted.
# k3s
a
This message was deleted.
c
its almost impossible to troubleshoot issues off just a few cherry-picked log lines. If you can upload the entire k3s service log somewhere there might be something interesting in it, but the info you provided isn’t enough to work off of.
o
Hi there, I understand, I just thought if it's something that might ring a bell quickly. Attached log from +- 2h before and after the event
c
just looks to me like you ran out of memory
bunch of
ollama
containers getting OOM killed
o
hmm..i saw that but since the server still had quite a lot of resources left I couldn't imagine that it would get to this point. I did notice there was a pod that kept restarting and it is possible that deployer did not allocate enough resources and it kept restarting and eventually that triggered the oom?
c
the various “context cancelled” errors appear to be just from clients disconnecting when pods were killed or deleted
or perhaps there were clients connected that dropped when your internet went away
but either way I don’t see anything wrong with k3s here
o
I see. Is it valid that an under-resourced pod could cause a global oom?
c
yeah if the process exceeds the memory limit it will get oom killed
o
by global oom i mean it could eat memory beyond what was allocated to the pod?
I might be asking nonsense, just my logic is: • if one of the pods gets oom it will get killed by the kernel, the limit should be the pod allocated memory • that should not affect any other pods
Do you think the possible issue is that swap is not disabled on this server, so it seemed there's more memory then there actually was?
c
I’ve not generally seen that be a problem
o
I've seen some mentioned about it somewhere that's why i'm asking. Do you have any suggestions on how to prevent this from happening in the future?
c
I don’t see an unexpected crash or deletion of pods, just see that your workload tried to use more memory than requested and it was oom killed. You can address that by allocating additional resources, or tuning the workload configuration to reduce the memory it’s using. There isn’t really anything to be done in K3s itself.
o
Makes sense. What exactly are you referring to as "workload configuration"? As I cannot change how much resources a client allocates for themselves (apart from filtering them out in the bidding process)?
c
the resource requests/limits in the pod spec
o
yeah that's something I cannot influence except in the bidding process to reject deployments that are below a certain threshold. There is no mechanism within k3s that would kill a pod if it has surpassed its allocated resources?
c
that is part of Kubernetes…
what do you mean by “except in bidding process”. Are you not directly controlling the resources you’re allocating to your pods?
o
akash network works in the way that when customers want to deploy a pod on your machine they specify in their yaml how many resources they want, and then the provider (a pod on my node running an akash provider docker image) gives a price on that and bids, only if the machine has enough resources.
That's why I don't fully understand why this happened as k3s killed the pod, but it was then restarted many times (I cannot fully confirm that it was restarted because of OOM, I did not check that morning when I noticed that) but it seems that restarting it 50 times caused resources to sky-rocket.
Hey again. I think I now understand more or less what happened. The leased pods have QoS level Guaranteed (all memory limits are set). So everything else get's evicted before them. The offending pod triggered the system kernel's OOM killer instead of k3s evicting it as the default behaviour is Delete. My potential strategy: 1. setting alerts for OOM 2. set an eviction policy that evicts a pod if it is restarted too many times. Default is delete, is it possible to set it per namespace or per QoS level? Do you have any documentation at hand on how to handle the 2nd solution ?
Regardless if you answer or not, thank you for your help so far, it was very helpful.