This message was deleted.
# rke2
a
This message was deleted.
s
If you're looking to set up alerting based on OOM events, you can use a combination of Kubernetes monitoring tools and log aggregation systems. For example, you can use Prometheus to monitor resource usage metrics, including memory usage, and set up alerts based on thresholds. You can also use tools like ELK Stack (Elasticsearch, Logstash, Kibana) or Fluentd to aggregate and analyze logs, including the pod logs for OOM events, and create alerts based on specific log patterns or exit codes. Additionally, Kubernetes itself provides event logging that can be monitored for OOM-related events. You can use
kubectl get events
or tools like
kubectl describe pod
to check for events related to pod terminations.
h
Thanks for the reply; I have both setup monitoring and logging (to LOKI)... the pod in question was restarted, and when I did
kubectl describe pod
it shows exit
error code of 137
however, I do not see in grafana the pod consumed anywhere close to the defined memory limit And, when I look at
kubectl logs pod_name
, I do not see any entry for OOM So, I am trying to figure out why this happened and where I can see so alerting can be setup.
s
If the pod consumed all of it's memory, or tried to, in a matter of a few seconds, you won't necessarily see it in Prometheus because it doesn't scrape the metrics every second.
h
Thank you! I suspected that was the case... I will have to check the frequency of prometheus for scraping the metrics
looks like default is 60-seconds per this doc
c
Also worth checking your system logs to confirm it's not the kernel doing the OOM kiling (this was the case for us recently when we had transparent huge pages enabled).