I’m trying to figure out why 2 out of 3 control pl...
# k3s
l
I’m trying to figure out why 2 out of 3 control planes are seeing approx 50% CPU usage over 5+ processes of k3s Server. Looking into I see that, in the audit logs there’s over 12000 entries of list nodes from the k3s-supervisor. This on a requestURI of: api/v1/nodes/?labelSelector=p2p.k3s.cattle.io%2Fenabled%3Dtrue That’s a lot of requests for that label … the label is on the nodes. We see this on v1.32.3+k3s1. Could this be the cause of the high cpu usage seems interesting to me that listing on this label is executed that often.
On version 1.32.1+k3s1 and below … there’s no p2p.k3s.cattle.io label on the control-plane nodes.
I found this commit: https://github.com/k3s-io/k3s/commit/95700aa6b327f77e8a8a992377aeabc89bc20ee5 Are those changes relevant to what I’m seeing?
c
this will only be enabled if you are running with --embedded-registry=true to enable spegel. That particular query will only be made when libp2p (spegel) is trying to bootstrap the p2p mesh because it has no peers. It sounds like your environment is somehow broken or misconfigured?
for the record, the query you’re seeing is here: https://github.com/k3s-io/k3s/blob/master/pkg/spegel/bootstrap.go#L179-L181 and has been around since spegel support was added in 1.29.1 https://github.com/k3s-io/k3s/pull/8977
l
Hmm. I’ll look into whether or not it could be us fooling around. But, downgrading the control-plane to 1.32.1 and we’re good. So hmmm. I’m earlier versions, <1.32.1 - the p2p label is not there on control-planes
We downgraded to 1.32.1
And yes we set —embedded: true both on <132.1 and above it
Of course there’s also the possibility that the newer spegel version is the culprit somehow
c
do you see any errors from spegel in the logs?
you can run with
--debug
or
debug: true
to get more output from it
l
--debug on K3s itself I assume?
I’ll give it a try when I can upgrade again
c
were you able to get any more info on this?
l
Thank you very much for reaching out. Still researching. And bumped our internal test cluster to 1.32.3 just today. To debug when the issue is there. Right now at kubecon. So the time I can dedicate is limited.
I enabled —debug on all 3 k3s server Nodes in the cluster. We see a huge bunch of bad Tls certificate. See: https://gist.github.com/larssb/23f7549427b3d31ae51cf5e7cea621c9
No parameters was changed between the upgrade. But, from k3s v1.32.1 to 1.32.3 there’s an upgrade to containerd 2, runc is bumped … Looked at issues applicable to tls bad cert. however, nothing really applicable to our situation Any idea @creamy-pencil-82913 ?
c
no, there’s not really much to work with here. What are all those IPs reporting the bad certificates? Are those nodes on your network, or pod IPs, or what? The etcd errors that you shared in a comment look like they’re from a normal startup of a server node?
I don’t see anything here at all from spegel or libp2p, which is the component that would be trying to find nodes with that label
l
Ip’s are pods yes. Nodes in the cluster and a clouds network backplane
Yeah k3s is trying to start up. But, never success.
c
that sounds like a different problem entirely
the little bit of etcd logs you shared show that it can’t connect to two of the peers. Are you trying to start up only one server of a 3-node cluster?
If you have 3 servers, you need at least 2 of them online. If this node can’t connect to at least one of the other servers it won’t ever start up. These logs say that there are at least 2 nodes that it can’t connect to:
Copy code
Apr 01 16:59:41 test-test-ctlplane-0 k3s[36770]: {"level":"warn","ts":"2025-04-01T16:59:41.795283Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"5b55765432c297","rtt":"0s","error":"dial tcp 192.168.114.86:2380: connect: connection refused"}
Apr 01 16:59:41 test-test-ctlplane-0 k3s[36770]: {"level":"warn","ts":"2025-04-01T16:59:41.795310Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"5b55765432c297","rtt":"0s","error":"dial tcp 192.168.114.86:2380: connect: connection refused"}
Apr 01 16:59:41 test-test-ctlplane-0 k3s[36770]: {"level":"warn","ts":"2025-04-01T16:59:41.797500Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"7026efc5b923151d","rtt":"0s","error":"dial tcp 192.168.114.85:2380: connect: connection refused"}
Apr 01 16:59:41 test-test-ctlplane-0 k3s[36770]: {"level":"warn","ts":"2025-04-01T16:59:41.797513Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"7026efc5b923151d","rtt":"0s","error":"dial tcp 192.168.114.85:2380: connect: connection refused"}
l
Yup. After enabling —debug —v=9 … no nodes is starting. Not saying that it’s caused by this. Just how it happened to go. So i need to get at least one more node up.