I have installed the rancher-logging app (106.0.2+...
# general
c
I have installed the rancher-logging app (106.0.2+up4.10.0-rancher.6, via the Rancher GUI) on RKE2 with 4 worker nodes. Problem: fluent-bit seems to die with an OOM about 450 times per day. The fluentbit version included in rancher-logging is 3.1.8. Is it possible it cannot handle cgroupv2 yet?
Copy code
[Thu Jun 26 09:59:41 2025] flb-pipeline invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=999
[Thu Jun 26 09:59:41 2025] CPU: 2 PID: 2291474 Comm: flb-pipeline Not tainted 6.1.0-29-amd64 #1  Debian 6.1.123-1
[Thu Jun 26 09:59:41 2025] Hardware name: Dell Inc. PowerEdge R740/06WXJT, BIOS 2.22.2 09/12/2024
[Thu Jun 26 09:59:41 2025] Call Trace:
[Thu Jun 26 09:59:41 2025]  <TASK>
[Thu Jun 26 09:59:41 2025]  dump_stack_lvl+0x44/0x5c
[Thu Jun 26 09:59:41 2025]  dump_header+0x4a/0x211
[Thu Jun 26 09:59:41 2025]  oom_kill_process.cold+0xb/0x10
[Thu Jun 26 09:59:41 2025]  out_of_memory+0x1fd/0x4c0
[Thu Jun 26 09:59:41 2025]  mem_cgroup_out_of_memory+0x134/0x150
[Thu Jun 26 09:59:41 2025]  try_charge_memcg+0x696/0x780
[Thu Jun 26 09:59:41 2025]  charge_memcg+0x39/0xf0
[Thu Jun 26 09:59:41 2025]  __mem_cgroup_charge+0x28/0x80
[Thu Jun 26 09:59:41 2025]  __handle_mm_fault+0x95c/0xfa0
[Thu Jun 26 09:59:41 2025]  handle_mm_fault+0xdb/0x2d0
[Thu Jun 26 09:59:41 2025]  do_user_addr_fault+0x191/0x550
[Thu Jun 26 09:59:41 2025]  exc_page_fault+0x70/0x170
[Thu Jun 26 09:59:41 2025]  asm_exc_page_fault+0x22/0x30
[Thu Jun 26 09:59:41 2025] RIP: 0033:0x7f3c1976ef4c
[Thu Jun 26 09:59:41 2025] Code: 00 00 00 74 a0 83 f9 c0 0f 87 56 fe ff ff 62 e1 fe 28 6f 4e 01 48 29 fe 48 83 c7 3f 49 8d 0c 10 48 83 e7 c0 48 01 fe 48 29 f9 <f3> a4 62 c1 fe 28 7f 00 62 c1 fe 28 7f 48 01 c3 0f 1f 40 00 4c 8b
[Thu Jun 26 09:59:41 2025] RSP: 002b:00007f3c181fa3c8 EFLAGS: 00010206
[Thu Jun 26 09:59:41 2025] RAX: 00007f3c0dfb01aa RBX: 00000000001e0000 RCX: 000000000000e84b
[Thu Jun 26 09:59:41 2025] RDX: 00000000000356a1 RSI: 00007f3c1177a196 RDI: 00007f3c0dfd7000
[Thu Jun 26 09:59:41 2025] RBP: 00000000000356a1 R08: 00007f3c0dfb01aa R09: 0000000000400000
[Thu Jun 26 09:59:41 2025] R10: 00000000001c0000 R11: 0000000000000048 R12: 00007f3c16169f40
[Thu Jun 26 09:59:41 2025] R13: 00007f3c11753340 R14: 00007f3c1604db80 R15: 00007f3c120da740
[Thu Jun 26 09:59:41 2025]  </TASK>
[Thu Jun 26 09:59:41 2025] memory: usage 97656kB, limit 97656kB, failcnt 4194
[Thu Jun 26 09:59:41 2025] swap: usage 0kB, limit 9007199254740988kB, failcnt 0
[Thu Jun 26 09:59:41 2025] Memory cgroup stats for /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod8b0a4ee5_d9cb_4ae3_876c_df15c8b305b5.slice:
[Thu Jun 26 09:59:41 2025] anon 97992704
                           file 4096
                           kernel 2002944
                           kernel_stack 147456
                           pagetables 356352
                           sec_pagetables 0
                           percpu 730296
                           sock 0
                           vmalloc 12288
                           shmem 0
                           zswap 0
                           zswapped 0
                           file_mapped 0
                           file_dirty 0
                           file_writeback 0
                           swapcached 0
                           anon_thp 46137344
                           file_thp 0
                           shmem_thp 0
                           inactive_anon 97984512
                           active_anon 8192
                           inactive_file 0
                           active_file 4096
                           unevictable 0
                           slab_reclaimable 250784
                           slab_unreclaimable 398248
                           slab 649032
                           workingset_refault_anon 0
                           workingset_refault_file 381
                           workingset_activate_anon 0
                           workingset_activate_file 1
                           workingset_restore_anon 0
                           workingset_restore_file 0
                           workingset_nodereclaim 26
                           pgscan 5984
                           pgsteal 3693
                           pgscan_kswapd 0
                           pgscan_direct 5984
                           pgsteal_kswapd 0
                           pgsteal_direct 3693
                           pgfault 1380514
                           pgmajfault 19
                           pgrefill 2120
                           pgactivate 2301
                           pgdeactivate 2120
                           pglazyfree 0
                           pglazyfreed 0
                           zswpin 0
                           zswpout 0
                           thp_fault_alloc 153
                           thp_collapse_alloc 0
[Thu Jun 26 09:59:41 2025] Tasks state (memory values in pages):
[Thu Jun 26 09:59:41 2025] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[Thu Jun 26 09:59:41 2025] [2231365] 65535 2231365      243        1    28672        0          -998 pause
[Thu Jun 26 09:59:41 2025] [2291372]     0 2291372    52297    27372   348160        0           999 fluent-bit
[Thu Jun 26 09:59:41 2025] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=cri-containerd-d4c669550fc25bc650c28f72b7bad4d279f0f68a94c79d7c9ccb729f2b83e20d.scope,mems_allowed=0-1,oom_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod8b0a4ee5_d9cb_4ae3_876c_df15c8b305b5.slice,task_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod8b0a4ee5_d9cb_4ae3_876c_df15c8b305b5.slice/cri-containerd-d4c669550fc25bc650c28f72b7bad4d279f0f68a94c79d7c9ccb729f2b83e20d.scope,task=fluent-bit,pid=2291372,uid=0
[Thu Jun 26 09:59:41 2025] Memory cgroup out of memory: Killed process 2291372 (fluent-bit) total-vm:209188kB, anon-rss:95464kB, file-rss:14024kB, shmem-rss:0kB, UID:0 pgtables:340kB oom_score_adj:999
[Thu Jun 26 09:59:41 2025] Tasks in /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod8b0a4ee5_d9cb_4ae3_876c_df15c8b305b5.slice/cri-containerd-d4c669550fc25bc650c28f72b7bad4d279f0f68a94c79d7c9ccb729f2b83e20d.scope are going to be killed due to memory.oom.group set
[Thu Jun 26 09:59:41 2025] Memory cgroup out of memory: Killed process 2291474 (flb-pipeline) total-vm:209188kB, anon-rss:95508kB, file-rss:14024kB, shmem-rss:0kB, UID:0 pgtables:340kB oom_score_adj:999
c
It's more likely you just need to increase the resources. Logging and monitoring are resource intensive.
c
AFAICT the helm chart brings its own cpu and memory requests and limits. The defaults set for the cattle-logging-system namespace in Rancher are ignored.
c
... just because you set limits on the ns doesn't mean that pods don't also have limits
You need to set chart values to modify the default requests and limits so as to function with whatever load your environment puts on it. This is true for anything you deploy.
The ns limits are for everything in the whole ns.
f
We had to do something similar as well. I'm not sure what underlying stack rancher-logging uses, but we wind up using kube-logging / banzai logging operator
These are manifests we use, you'll need to adjust as you see fit
Copy code
# NB: The Kube-Logging site has better CRD docs than the Cisco/BanzaiCloud
#     site. <https://kube-logging.github.io/>
# NB: Some flags translate down to fluentd flags, so check their docs
#     for more info. <https://docs.fluentd.org/configuration/>
---
apiVersion: logging.banzaicloud.io/v1beta1
kind: Logging
metadata:
  name: {{ LOGGING_NS }}
spec:
  controlNamespace: {{ LOGGING_NS }}
  fluentd:
    disablePvc: true
    resources:
      limits:
        memory: 800M
      requests:
        memory: 400M
    scaling:
      drain:
        enabled: true
  fluentbit:
    # Tweak fluentbit to run on controlplane nodes as well
    tolerations:
    - effect: NoExecute
      key: CriticalAddonsOnly
      operator: Exists
    # Tweak fluentbit memory limits, defaults are 50/100M, which cause a lot of OOM kills
    resources:
      requests:
        memory: 200M
      limits:
        memory: 200M
---
# Import Kubernetes events into logs
apiVersion: logging-extensions.banzaicloud.io/v1alpha1
kind: EventTailer
metadata:
  name: event-tailer
spec:
  controlNamespace: {{ LOGGING_NS }}
---
... ClusterFlow and ClusterOutput manifests
iirc, our starting point was to bump it up until it was consuming 50% of its request & limit. I think we may have needed to bump it up again to deal with when our cluster came under some heavier load and the operators needed to chew through some more data
These are the Helm chart values we use for the operator itself
Copy code
rbac:
  psp:
    enabled: False
securityContext:
  allowPrivilegeEscalation: false
  readOnlyRootFilesystem: true
  capabilities:
    drop: ["ALL"]
The
rbac.psp.enabled
field is probably not necessary anymore
Now I'm remembering the counterintuitive part... The request/limits aren't part of the Helm values anymore; they're part of the
Logging
resource you create AFTER helm is installed to setup fluend/fluentbit