This message was deleted Rancher Users #k3s

Join Slack

This message was deleted.

# k3s

adamant-kite-43734

05/10/2024, 9:24 AM

This message was deleted.

creamy-pencil-82913

05/10/2024, 3:52 PM

its almost impossible to troubleshoot issues off just a few cherry-picked log lines. If you can upload the entire k3s service log somewhere there might be something interesting in it, but the info you provided isn’t enough to work off of.

orange-portugal-84707

05/10/2024, 3:54 PM

Hi there, I understand, I just thought if it's something that might ring a bell quickly. Attached log from +- 2h before and after the event

journalctl-filtered.txt

creamy-pencil-82913

05/10/2024, 3:59 PM

just looks to me like you ran out of memory

creamy-pencil-82913

05/10/2024, 3:59 PM

bunch of

ollama

containers getting OOM killed

orange-portugal-84707

05/10/2024, 4:02 PM

hmm..i saw that but since the server still had quite a lot of resources left I couldn't imagine that it would get to this point. I did notice there was a pod that kept restarting and it is possible that deployer did not allocate enough resources and it kept restarting and eventually that triggered the oom?

creamy-pencil-82913

05/10/2024, 4:03 PM

the various “context cancelled” errors appear to be just from clients disconnecting when pods were killed or deleted

creamy-pencil-82913

05/10/2024, 4:03 PM

or perhaps there were clients connected that dropped when your internet went away

creamy-pencil-82913

05/10/2024, 4:03 PM

but either way I don’t see anything wrong with k3s here

orange-portugal-84707

05/10/2024, 4:05 PM

I see. Is it valid that an under-resourced pod could cause a global oom?

creamy-pencil-82913

05/10/2024, 4:05 PM

yeah if the process exceeds the memory limit it will get oom killed

orange-portugal-84707

05/10/2024, 4:06 PM

by global oom i mean it could eat memory beyond what was allocated to the pod?

orange-portugal-84707

05/10/2024, 4:11 PM

I might be asking nonsense, just my logic is: • if one of the pods gets oom it will get killed by the kernel, the limit should be the pod allocated memory • that should not affect any other pods

orange-portugal-84707

05/10/2024, 4:30 PM

Do you think the possible issue is that swap is not disabled on this server, so it seemed there's more memory then there actually was?

creamy-pencil-82913

05/10/2024, 4:36 PM

I’ve not generally seen that be a problem

orange-portugal-84707

05/10/2024, 4:37 PM

I've seen some mentioned about it somewhere that's why i'm asking. Do you have any suggestions on how to prevent this from happening in the future?

creamy-pencil-82913

05/10/2024, 4:41 PM

I don’t see an unexpected crash or deletion of pods, just see that your workload tried to use more memory than requested and it was oom killed. You can address that by allocating additional resources, or tuning the workload configuration to reduce the memory it’s using. There isn’t really anything to be done in K3s itself.

orange-portugal-84707

05/10/2024, 4:43 PM

Makes sense. What exactly are you referring to as "workload configuration"? As I cannot change how much resources a client allocates for themselves (apart from filtering them out in the bidding process)?

creamy-pencil-82913

05/10/2024, 4:46 PM

the resource requests/limits in the pod spec

orange-portugal-84707

05/10/2024, 5:19 PM

yeah that's something I cannot influence except in the bidding process to reject deployments that are below a certain threshold. There is no mechanism within k3s that would kill a pod if it has surpassed its allocated resources?

creamy-pencil-82913

05/10/2024, 5:33 PM

that is part of Kubernetes…

creamy-pencil-82913

05/10/2024, 5:33 PM

https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#my-container-is-terminated

creamy-pencil-82913

05/10/2024, 5:34 PM

what do you mean by “except in bidding process”. Are you not directly controlling the resources you’re allocating to your pods?

orange-portugal-84707

05/10/2024, 5:44 PM

akash network works in the way that when customers want to deploy a pod on your machine they specify in their yaml how many resources they want, and then the provider (a pod on my node running an akash provider docker image) gives a price on that and bids, only if the machine has enough resources.

orange-portugal-84707

05/10/2024, 5:46 PM

That's why I don't fully understand why this happened as k3s killed the pod, but it was then restarted many times (I cannot fully confirm that it was restarted because of OOM, I did not check that morning when I noticed that) but it seems that restarting it 50 times caused resources to sky-rocket.

orange-portugal-84707

05/11/2024, 9:17 AM

Hey again. I think I now understand more or less what happened. The leased pods have QoS level Guaranteed (all memory limits are set). So everything else get's evicted before them. The offending pod triggered the system kernel's OOM killer instead of k3s evicting it as the default behaviour is Delete. My potential strategy: 1. setting alerts for OOM 2. set an eviction policy that evicts a pod if it is restarted too many times. Default is delete, is it possible to set it per namespace or per QoS level? Do you have any documentation at hand on how to handle the 2nd solution ?

orange-portugal-84707

05/11/2024, 4:33 PM

Regardless if you answer or not, thank you for your help so far, it was very helpful.

Open in Slack

Previous Next