Hey all wranglers :cow2: Have an intermittent issu...
# general
b
Hey all wranglers 🐄 Have an intermittent issue with my downstream clusters that i hope to get sone insight to. 🙂 I have a Rancher instance deployed on a k3s cluster with about 35 downstream clusters connected to it, these clusters are on edge locations which sometime lose network connectivity and therefore may loose connection to Rancher. This is of course expected and most of the time when connectivity is restored the cluster reconnects to Rancher again. However sometimes a cluster just fails to reconnect and the only fix i've identified is to manually delete the
cattle-cluster-agent
pod on the downstream cluster at which point it will reconnect again. In the agent pod logs there are some timeout errors to the upstream Rancher instance but it seems to just "give up" reconnecting at some point.
Copy code
time="2025-04-27T21:24:42Z" level=warning msg="[850] encountered error \"write tcp 10.239.24.70:54512->172.30.103.51:443: i/o timeout\" while writing error \"tunnel disconnect\" to close remotedialer"
time="2025-04-27T21:24:42Z" level=warning msg="[850] encountered error \"write tcp 10.239.24.70:54512->172.30.103.51:443: i/o timeout\" while writing error \"io: read/write on closed pipe\" to close remotedialer"
time="2025-04-27T21:24:42Z" level=error msg="Failed to dial steve aggregation server: read tcp 10.239.24.70:54512->172.30.103.51:443: i/o timeout"
time="2025-04-27T21:34:00Z" level=error msg="Failed to dial steve aggregation server: read tcp 10.239.24.70:52390->172.30.103.51:443: i/o timeout"
time="2025-04-27T21:34:10Z" level=error msg="Failed to dial steve aggregation server: dial tcp 172.30.103.51:443: i/o timeout"
time="2025-04-27T21:39:47Z" level=error msg="Failed to dial steve aggregation server: read tcp 10.239.24.70:42104->172.30.103.51:443: i/o timeout"
time="2025-04-27T22:23:04Z" level=info msg="Downloading repo index from <https://azure.github.io/secrets-store-csi-driver-provider-azure/charts/index.yaml>"
time="2025-04-27T22:26:42Z" level=error msg="Failed to dial steve aggregation server: read tcp 10.239.24.70:34800->172.30.103.51:443: i/o timeout"
time="2025-04-27T22:26:52Z" level=error msg="Failed to dial steve aggregation server: dial tcp: lookup our.rancher.hostname: i/o timeout"
time="2025-04-27T22:27:02Z" level=error msg="Failed to dial steve aggregation server: dial tcp: lookup our.rancher.hostname: i/o timeout"
time="2025-04-27T22:27:12Z" level=error msg="Failed to dial steve aggregation server: dial tcp: lookup our.rancher.hostname: i/o timeout"
time="2025-04-27T22:27:22Z" level=error msg="Failed to dial steve aggregation server: dial tcp 172.30.103.51:443: i/o timeout"
time="2025-04-27T22:27:32Z" level=error msg="Failed to dial steve aggregation server: dial tcp 172.30.103.51:443: i/o timeout"
Any pointers on how to resolve this? Some value i can tweak to increase the timeout or retry attempts perhaps? Thanks 🙏