10/28/2022, 4:36 PM
Not sure if it's where I should ask since it's about prometheus, but it's the helm chart installed via the Rancher UI. The prometheus pod keeps dying with OOM it appears. I haven't seen any more explanation so far. It's been running ok for about a week, so I wonder if it's just a sizing issue.
CrashLoopBackOff (back-off 5m0s restarting failed container=prometheus pod=prometheus-rancher-monitoring-prometheus-0_cattle-monitoring-system(5296c2b1-660b-4c15-a16f-b139a66b559d)) | Last state: Terminated with 137: OOMKilled (ponent=tsdb msg="WAL segment loaded" segment=207 maxSegment=208 level=info ts=2022-10-28T16:29:49.827Z caller=head.go:854 component=tsdb msg="WAL segment loaded" segment=208 maxSegment=208 level=info ts=2022-10-28T16:29:49.828Z caller=head.go:860 component=tsdb msg="WAL replay completed" checkpoint_replay_duration=12.698413338s
using the default values in the helm chart, only changed the retention to 10d. memory request 750M, limit 3GB,
Is there a way to ballpark the memory usage of the container based on the data retention, the number of dashboards etc...
I bumped the memory to 10G, and I see it is indeed using 3.7G at times.


10/28/2022, 4:53 PM
prometheus’s memory utilization isn’t really influenced by the number of dashboards. It’s more about data cardinality - how many different elements is it collecting in a period of time. More nodes / pods / services / etc will push that up. Basically, the larger your cluster, the more memory it will need.


10/28/2022, 5:00 PM
thanks. That makes a lot of sense. I was a bit lazy typing, that's what I meant in my head.