https://rancher.com/ logo
s

salmon-noon-33588

11/02/2022, 1:00 AM
Hey uh, recently it seems our
cattle-cluster-agent
has lost its mind in our sandbox cluster to the point where the cluster becomes unusuable. Whenever it's running, the CPU usage of all of the api-servers in the cluster are pegged at 100%. I'm not sure if this is requests coming in from Rancher or the agent. There don't seem to be any suspicious logs except for the persistent
error syncing 'system-library': handler system-image-upgrade-catalog-controller: upgrade cluster {cluster} system service alerting failing: template system-library-rancher-monitoring incompatible with rancher version or cluster's [{cluster}] kubernetes version, requeueing
. Oddly, we get this message for the cluster whether or not its agent is running. After a little bit, we start seeing errors that seem related to the API server just being too busy, things like: •
Unexpected error when reading response body: context canceled
"Reflector ListAndWatch" name:pkg/mod/github.com/rancher/client-go@v1.24.0-rancher1/tools/cache/reflector.go:168 (02-Nov-2022 00:36:47.268) (total time: 29608ms)...
And others. Is it possible to determine what the agent is doing that's causing this? I'm wondering if these nodes are a tad underprovisioned at 2 cores and 8GB of RAM? That seems weird though, they've been fine for a few years now. Also: Rancher v2.6.7 and Kubernetes v1.22.11.
Looks like shutting down the Rancher server also causes a CPU usage drop, so Rancher is probably doing something that's blowing up that cluster.
So, finally discovered what was up. Turns out I had somehow put a User resource in the Rancher cluster into a weird state when doing the 2.6.7 Azure API transition. The user was being infinitely refreshed but never completing because, I guess, it had no UserAttribute resource. Nuking that user from the Rancher UI immediately quieted down all clusters.
316 Views