11/28/2022, 1:57 PM
Hi all, I'm having issues with metrics and I can't find anything helpful online. I have a cluster with 3 etcd + control plane nodes and 4 workers. Docker 20.10.21, Kubernetes 1.24.6, Rancher 2.6.9. Everything was fine in the beginning, but after a few days, one after the other, nodes stopped reporting metrics. Some are still working, but I'm sure that if I leave them long enough, they will stop too. The metrics server reports:
Failed to scrape node" err="Get \"https://***:10250/metrics/resource\": context deadline exceeded" node="node-that-stopped-reporting"
kubectl top node shows:
node-that-stopped-working <unknown> <unknown> <unknown> <unknown>
HPAs are not working because of this, and I'm not sure what other issues this could cause. If I restart the node, the reporting starts working again for a few days, and then it stops. The machines are powerful enough and this problem started when the cluster was still empty, so I doubt is resources-related. Even now that the cluster is running some workloads, there are pleny of resources available. I have no idea what could cause this, but I noticed something interesting:
kubectl get --raw /api/v1/nodes/$NODE_NAME/proxy/stats/summary
takes a really long time (~2 minutes) on those nodes that don't work, while it's fast (~2 seconds) on a working node. Any kind of help is greatly appreciated!