Hi everyone! I've been struggling to diagnose high...
# k3s
c
Hi everyone! I've been struggling to diagnose high CPU utilization on my control-plane nodes. I'm running a HA k3s cluster with embedded etcd, kube-vip for loadbalancing, and managing that cluster with Rancher. I will periodically see high CPU utilisation on just one control plane node with no apparent pod causing the high usage. It's always on the leader server and can be temporarily mitigated by restarting the k3s service on the host OS. My theory is that it is somehow being caused by excessive requests to the Kubernetes API. After turning on audit logging we are seeing many requests per second that look like the following:
Copy code
{
  "kind": "Event",
  "apiVersion": "<http://audit.k8s.io/v1|audit.k8s.io/v1>",
  "level": "Metadata",
  "auditID": "249d3969-b4af-4444-99b2-77e284b0521f",
  "stage": "ResponseComplete",
  "requestURI": "/api/v1/nodes?labelSelector=p2p.k3s.cattle.io%2Fenabled%3Dtrue",
  "verb": "list",
  "user": {
    "username": "system:k3s-supervisor",
    "groups": [
      "system:masters",
      "system:authenticated"
    ]
  },
  "sourceIPs": [
    "127.0.0.1"
  ],
  "userAgent": "k3s-supervisor@ctl01-k8s/v1.31.7+k3s1 (linux/amd64) k3s/e050ca66",
  "objectRef": {
    "resource": "nodes",
    "apiVersion": "v1"
  },
  "responseStatus": {
    "metadata": {},
    "code": 200
  },
  "requestReceivedTimestamp": "2025-04-25T15:07:41.324466Z",
  "stageTimestamp": "2025-04-25T15:07:41.326605Z",
  "annotations": {
    "<http://authorization.k8s.io/decision|authorization.k8s.io/decision>": "allow",
    "<http://authorization.k8s.io/reason|authorization.k8s.io/reason>": ""
  }
}
Has anyone seen this before? I'd write it off and rebuild, but I've already rebuilt our host machines (RHEL 9) and reinstalled k3s, but this issue even persists after that.
d
try to install some monitoring tool or a simple htop inside to see what is running inside
try to set some limits and requests in your services if there are not configured
thats to diagnose what is causing that
maybe there is a OOM Kill in some place
c
Thanks for the suggestions! I am running kube-prometheus-stack and seeing a high RPC rate on etcd (~300 ops/s), and high work queue latency (10s) for some parts of the kubernetes API. Are you suggesting running htop on the host OS? If so, I'm only seeing the high CPU being attributed to the k3s service and no breakdown beyond that. I also have the below api limits on my `config.yaml`:
Copy code
kube-apiserver-arg:
 - 'max-requests-inflight=100'
 - 'max-mutating-requests-inflight=50'
kube-controller-manager-arg:
 - 'kube-api-qps=20'
 - 'kube-api-burst=30'
d
depending on what you find try to add some swap if possible
c
you didn’t say what version you’re on, but this sounds a lot like https://github.com/k3s-io/k3s/issues/12127
👀 1
c
Ha! I just came back to this chat after digging that up as well! I just started using Harbor as a pull-through image registry anyway, so I am going to try disabling Spegel registry mirroring to see if that clears things up. Thanks!
d
I contribute into Harbor @clean-piano-40966 in case if there is something need it
c
@damp-xylophone-94549 Perfect, I'll keep that in mind thank you! @creamy-pencil-82913 I'm on
v1.31.7+k3s1
and it's (currently) looking like turning off that image registry mirroring has reduced my cpu. Love to see graphs like this:
c
it’s fixed for this month’s releases, if you do eventually want to turn it back on again
c
Awesome, thank you for the insight.