salmon-noon-33588
11/02/2022, 1:00 AMcattle-cluster-agent
has lost its mind in our sandbox cluster to the point where the cluster becomes unusuable.
Whenever it's running, the CPU usage of all of the api-servers in the cluster are pegged at 100%. I'm not sure if this is requests coming in from Rancher or the agent. There don't seem to be any suspicious logs except for the persistent error syncing 'system-library': handler system-image-upgrade-catalog-controller: upgrade cluster {cluster} system service alerting failing: template system-library-rancher-monitoring incompatible with rancher version or cluster's [{cluster}] kubernetes version, requeueing
. Oddly, we get this message for the cluster whether or not its agent is running.
After a little bit, we start seeing errors that seem related to the API server just being too busy, things like:
• Unexpected error when reading response body: context canceled
• "Reflector ListAndWatch" name:pkg/mod/github.com/rancher/client-go@v1.24.0-rancher1/tools/cache/reflector.go:168 (02-Nov-2022 00:36:47.268) (total time: 29608ms)...
And others. Is it possible to determine what the agent is doing that's causing this? I'm wondering if these nodes are a tad underprovisioned at 2 cores and 8GB of RAM? That seems weird though, they've been fine for a few years now. Also:
Rancher v2.6.7 and Kubernetes v1.22.11.