Hey all wranglers cow2 Have an intermittent issue with my do Rancher Users #general

Hey all wranglers :cow2: Have an intermittent issu...

billowy-apple-60989

04/28/2025, 1:48 PM

Hey all wranglers 🐄 Have an intermittent issue with my downstream clusters that i hope to get sone insight to. 🙂 I have a Rancher instance deployed on a k3s cluster with about 35 downstream clusters connected to it, these clusters are on edge locations which sometime lose network connectivity and therefore may loose connection to Rancher. This is of course expected and most of the time when connectivity is restored the cluster reconnects to Rancher again. However sometimes a cluster just fails to reconnect and the only fix i've identified is to manually delete the

cattle-cluster-agent

pod on the downstream cluster at which point it will reconnect again. In the agent pod logs there are some timeout errors to the upstream Rancher instance but it seems to just "give up" reconnecting at some point.

Copy code

time="2025-04-27T21:24:42Z" level=warning msg="[850] encountered error \"write tcp 10.239.24.70:54512->172.30.103.51:443: i/o timeout\" while writing error \"tunnel disconnect\" to close remotedialer"
time="2025-04-27T21:24:42Z" level=warning msg="[850] encountered error \"write tcp 10.239.24.70:54512->172.30.103.51:443: i/o timeout\" while writing error \"io: read/write on closed pipe\" to close remotedialer"
time="2025-04-27T21:24:42Z" level=error msg="Failed to dial steve aggregation server: read tcp 10.239.24.70:54512->172.30.103.51:443: i/o timeout"
time="2025-04-27T21:34:00Z" level=error msg="Failed to dial steve aggregation server: read tcp 10.239.24.70:52390->172.30.103.51:443: i/o timeout"
time="2025-04-27T21:34:10Z" level=error msg="Failed to dial steve aggregation server: dial tcp 172.30.103.51:443: i/o timeout"
time="2025-04-27T21:39:47Z" level=error msg="Failed to dial steve aggregation server: read tcp 10.239.24.70:42104->172.30.103.51:443: i/o timeout"
time="2025-04-27T22:23:04Z" level=info msg="Downloading repo index from <https://azure.github.io/secrets-store-csi-driver-provider-azure/charts/index.yaml>"
time="2025-04-27T22:26:42Z" level=error msg="Failed to dial steve aggregation server: read tcp 10.239.24.70:34800->172.30.103.51:443: i/o timeout"
time="2025-04-27T22:26:52Z" level=error msg="Failed to dial steve aggregation server: dial tcp: lookup our.rancher.hostname: i/o timeout"
time="2025-04-27T22:27:02Z" level=error msg="Failed to dial steve aggregation server: dial tcp: lookup our.rancher.hostname: i/o timeout"
time="2025-04-27T22:27:12Z" level=error msg="Failed to dial steve aggregation server: dial tcp: lookup our.rancher.hostname: i/o timeout"
time="2025-04-27T22:27:22Z" level=error msg="Failed to dial steve aggregation server: dial tcp 172.30.103.51:443: i/o timeout"
time="2025-04-27T22:27:32Z" level=error msg="Failed to dial steve aggregation server: dial tcp 172.30.103.51:443: i/o timeout"

Any pointers on how to resolve this? Some value i can tweak to increase the timeout or retry attempts perhaps? Thanks 🙏

22 Views

Open in Slack

Previous Next