This message was deleted.
# general
a
This message was deleted.
c
It's more likely you just need to increase the resources. Logging and monitoring are resource intensive.
c
AFAICT the helm chart brings its own cpu and memory requests and limits. The defaults set for the cattle-logging-system namespace in Rancher are ignored.
c
... just because you set limits on the ns doesn't mean that pods don't also have limits
You need to set chart values to modify the default requests and limits so as to function with whatever load your environment puts on it. This is true for anything you deploy.
The ns limits are for everything in the whole ns.
f
We had to do something similar as well. I'm not sure what underlying stack rancher-logging uses, but we wind up using kube-logging / banzai logging operator
These are manifests we use, you'll need to adjust as you see fit
Copy code
# NB: The Kube-Logging site has better CRD docs than the Cisco/BanzaiCloud
#     site. <https://kube-logging.github.io/>
# NB: Some flags translate down to fluentd flags, so check their docs
#     for more info. <https://docs.fluentd.org/configuration/>
---
apiVersion: logging.banzaicloud.io/v1beta1
kind: Logging
metadata:
  name: {{ LOGGING_NS }}
spec:
  controlNamespace: {{ LOGGING_NS }}
  fluentd:
    disablePvc: true
    resources:
      limits:
        memory: 800M
      requests:
        memory: 400M
    scaling:
      drain:
        enabled: true
  fluentbit:
    # Tweak fluentbit to run on controlplane nodes as well
    tolerations:
    - effect: NoExecute
      key: CriticalAddonsOnly
      operator: Exists
    # Tweak fluentbit memory limits, defaults are 50/100M, which cause a lot of OOM kills
    resources:
      requests:
        memory: 200M
      limits:
        memory: 200M
---
# Import Kubernetes events into logs
apiVersion: logging-extensions.banzaicloud.io/v1alpha1
kind: EventTailer
metadata:
  name: event-tailer
spec:
  controlNamespace: {{ LOGGING_NS }}
---
... ClusterFlow and ClusterOutput manifests
iirc, our starting point was to bump it up until it was consuming 50% of its request & limit. I think we may have needed to bump it up again to deal with when our cluster came under some heavier load and the operators needed to chew through some more data
These are the Helm chart values we use for the operator itself
Copy code
rbac:
  psp:
    enabled: False
securityContext:
  allowPrivilegeEscalation: false
  readOnlyRootFilesystem: true
  capabilities:
    drop: ["ALL"]
The
rbac.psp.enabled
field is probably not necessary anymore
Now I'm remembering the counterintuitive part... The request/limits aren't part of the Helm values anymore; they're part of the
Logging
resource you create AFTER helm is installed to setup fluend/fluentbit