11/02/2022, 1:00 AM
Hey uh, recently it seems our
has lost its mind in our sandbox cluster to the point where the cluster becomes unusuable. Whenever it's running, the CPU usage of all of the api-servers in the cluster are pegged at 100%. I'm not sure if this is requests coming in from Rancher or the agent. There don't seem to be any suspicious logs except for the persistent
error syncing 'system-library': handler system-image-upgrade-catalog-controller: upgrade cluster {cluster} system service alerting failing: template system-library-rancher-monitoring incompatible with rancher version or cluster's [{cluster}] kubernetes version, requeueing
. Oddly, we get this message for the cluster whether or not its agent is running. After a little bit, we start seeing errors that seem related to the API server just being too busy, things like: •
Unexpected error when reading response body: context canceled
"Reflector ListAndWatch" name:pkg/mod/ (02-Nov-2022 00:36:47.268) (total time: 29608ms)...
And others. Is it possible to determine what the agent is doing that's causing this? I'm wondering if these nodes are a tad underprovisioned at 2 cores and 8GB of RAM? That seems weird though, they've been fine for a few years now. Also: Rancher v2.6.7 and Kubernetes v1.22.11.
Looks like shutting down the Rancher server also causes a CPU usage drop, so Rancher is probably doing something that's blowing up that cluster.
So, finally discovered what was up. Turns out I had somehow put a User resource in the Rancher cluster into a weird state when doing the 2.6.7 Azure API transition. The user was being infinitely refreshed but never completing because, I guess, it had no UserAttribute resource. Nuking that user from the Rancher UI immediately quieted down all clusters.