Hello Rancher friends! :wave: We're running into a...
# general
f
Hello Rancher friends! 👋 We're running into an issue with one of our Rancher clusters and looking for some advice. We recently upgraded one of our Rancher clusters from RKE2
1.31.3
to
1.31.7
. The upgrade seemed to work fine and the Rancher cluster and its two child clusters appear to be working correctly - no problems with any workloads. With that said, if we log into Rancher and click on the name of one of the child clusters (e.g.
dev
). we receive an
Error: Unknown schema for type: namespace
error. Although not exactly the same, the error looks similar to the image below (I borrowed this image from a different Github issue). We're not quite sure why this error is happening and it doesn't seem like something that an RKE2 upgrade should trigger. Looking into it a bit further, if I look at the logs of any of the
cattle-cluster-agent
pods on the child cluster, there are a lot of errors of the form:
Copy code
level=error msg="failed to sync schemas: the server is currently unable to handle the request"
I've tried increasing the log level on the
cattle-cluster-agent
deployment by setting the
CATTLE_TRACE
environment variable to
true
but this hasn't yielded any additional relevant logging information. These clusters are running in an airgap environment and so I can't readily share any logs in their entirety. Looking in the web browser's logs, I can see that when we click on one of these clusters, it sends a request to
<hxxps://rancher.ourdomain.com/k8s/clusters/clusterid-goes-here/v1/schema>
. Comparing the results of this request between a known working cluster and the currently misbehaving cluster, the data returned in the misbehaving cluster is a lot smaller. On the working cluster, you can see that the results being returned contain descriptions of virtually every resource type in the cluster; on the non-working cluster, it's missing lots of types including build-in Kubernetes types such as
Namespace
, which seems to explain the UI error in the screenshot. As an additional data point, we've tried creating a brand new cluster in Rancher after having performed the upgrade (our cleverly named
debug
cluster) and this cluster works perfectly fine - no errors in the
cattle-cluster-agent
on the new cluster at all. Suffice to say it seems that Rancher is generally working fine - it just seems to have some trouble communicating with the existing clusters for some reason. So - has anyone else experienced anything like this and have suggestions for additional troubleshooting/debugging that can be done? Is there an easy way to tell Rancher to "re-onboard" an existing cluster in order to make sure its
cattle-cluster-agent
deployment is set up correctly to talk to Rancher? Environment: • Rancher
2.10.3
• RKE2
1.31.7
on Ubuntu
22.04
• vSphere 7 on-prem