Hello have anyone experience this issue on RKE2 version v1 3 Rancher Users #rke2

Hello, have anyone experience this issue on RKE2 v...

future-jordan-75926

11/03/2025, 11:36 AM

Hello, have anyone experience this issue on RKE2 version: v1.33.0+rke2r1 Getting logs from pods on workers return the following error. Restarting rke2 service on the node resolve this issue temporarily

Copy code

Get \\"<https://49.13.68.75:10250/containerLogs/airflow/dim-paypal-checked-list-ta5tu5lt/base?follow=true>\\u0026timestamps=true\\": proxy error from 127.0.0.1:9345 while dialing 49.13.68.75:10250, code 502: 502 Bad Gateway","code":500

adamant-branch-25874

11/03/2025, 11:48 AM

I have the exact same issue on one of my clusters. Still need to investigate this further though. If you got any more leads to the root cause, please share 😄

narrow-guitar-87575

11/03/2025, 11:55 AM

isn't something altering iptables rules?

adamant-branch-25874

11/03/2025, 11:56 AM

iptables is disabled at my end. It seems some internal tunnel breaks without being noticed.

future-jordan-75926

11/03/2025, 12:09 PM

But it seems that rke2 proxy that is deployed on master nodes fails, right?

future-jordan-75926

11/03/2025, 12:10 PM

yep, also no iptables on my end also

narrow-guitar-87575

11/03/2025, 12:11 PM

and is there any log in the proxy, why it is failing?

future-jordan-75926

11/03/2025, 12:30 PM

Can't find it, by proxy I mean this service that runs on 9345 port

narrow-guitar-87575

11/03/2025, 1:57 PM

but kube-proxy pod has no error?

narrow-guitar-87575

11/03/2025, 1:58 PM

but actually, 10250 is kubelet. no errors in /var/lib/rancher/rke2/agent/kubelet.log ?

future-jordan-75926

11/03/2025, 2:08 PM

We don't have kube-proxy, we're using cilium with ebpf

narrow-guitar-87575

11/03/2025, 2:11 PM

are nodes l2 reachable or vxlans?

future-jordan-75926

11/03/2025, 3:27 PM

narrow-guitar-87575

11/03/2025, 3:32 PM

do you have direct routes?

creamy-pencil-82913

11/03/2025, 5:26 PM

This indicates that the websocket tunnel from the agent to the server is getting disconnected and cannot reconnect. Check the logs on the agent for messages about remotedialer/websocket

adamant-branch-25874

11/03/2025, 5:36 PM

In my case, 10Gbps very low latency network. Canal for networking using host-gateway, so direct L2 connectivity. Even debug loglevel didn't bring me closer to finding an exact indication in the logs. Research indeed indicated that the websocket tunnel goes down, but I would expect some sort of healthcheck on this tunnel to restart it in case it goes down. (Or at least more logging about this problem)

creamy-pencil-82913

11/03/2025, 6:04 PM

It does have a health check, and you should see errors from it trying to reconnect every 30 seconds or so.

creamy-pencil-82913

11/03/2025, 6:05 PM

Can you confirm that all of your servers have distinct IPs? Show the output of

kubectl get node -o wide

and

kubectl get endpoints -n default kubernetes -o yaml

creamy-pencil-82913

11/03/2025, 6:06 PM

There should be a unique endpoint address listed for the Kubernetes service for each of your servers.

adamant-branch-25874

11/04/2025, 8:25 AM

I have checked and all my servers have unique ips. (I would also expect a lot more issues when they're not unique) At this time, I don't have an installation were the tunnel is broken. But at the last occurence it was broken for 2 weeks before it was noticed.

creamy-pencil-82913

11/04/2025, 8:42 AM

provide logs, when you can.

creamy-pencil-82913

11/04/2025, 8:43 AM

also, info on what version you’re on - since you’re replying to someone else’s thread

future-jordan-75926

11/04/2025, 8:52 AM

I think this issue is present on all of the versions, though

future-jordan-75926

11/04/2025, 8:52 AM

Will provide logs from kubelet

creamy-pencil-82913

11/04/2025, 8:57 AM

there have been many improvements to the agent tunnel and health checking over the last year. I am not aware of any issues in the latest releases.

creamy-pencil-82913

11/04/2025, 8:57 AM

not kubelet logs. rke2-agent logs from journald.

creamy-pencil-82913

11/04/2025, 8:58 AM

and rke2-server logs from all servers, for the same timeframe

future-jordan-75926

11/04/2025, 9:08 AM

Copy code

k logs airflow-worker-3
Defaulted container "airflow-worker" out of: airflow-worker, dags-git-sync, log-cleanup, dags-git-clone (init), check-db (init), wait-for-db-migrations (init)
Error from server: Get "<https://49.13.68.75:10250/containerLogs/airflow/airflow-worker-3/airflow-worker>": proxy error from 127.0.0.1:9345 while dialing 49.13.68.75:10250, code 502: 502 Bad Gateway

agent

Copy code

root@worker04:~# journalctl -u rke2-agent.service -f
Nov 02 22:18:59 worker04.blabla rke2[1834650]: time="2025-11-02T22:17:17Z" level=info msg="Server 23.88.46.2:6443@PREFERRED->HEALTHY from successful health check"
Nov 02 22:18:59 worker04.blabla rke2[1834650]: time="2025-11-02T22:18:04Z" level=info msg="Server 128.140.92.225:6443@RECOVERING->PREFERRED from successful health check"
Nov 02 22:18:59 worker04.blabla rke2[1834650]: time="2025-11-02T22:18:58Z" level=info msg="Server 49.13.13.59:6443@FAILED->RECOVERING from successful health check"
Nov 02 22:18:59 worker04.blabla rke2[1834650]: time="2025-11-02T22:18:58Z" level=info msg="Server 128.140.92.225:6443@PREFERRED->FAILED from failed dial"
Nov 02 22:18:59 worker04.blabla rke2[1834650]: time="2025-11-02T22:18:58Z" level=info msg="Server 23.88.46.2:6443@HEALTHY->ACTIVE from successful dial"
Nov 02 22:19:00 worker04.blabla rke2[1834650]: time="2025-11-02T22:18:59Z" level=info msg="Server 49.13.13.59:6443@RECOVERING->PREFERRED from successful health check"
Nov 02 22:19:00 worker04.blabla rke2[1834650]: time="2025-11-02T22:19:00Z" level=info msg="Server 128.140.92.225:6443@FAILED->RECOVERING from successful health check"
Nov 02 22:19:01 worker04.blabla rke2[1834650]: time="2025-11-02T22:19:01Z" level=info msg="Server 128.140.92.225:6443@RECOVERING->PREFERRED from successful health check"
Nov 02 22:20:00 worker04.blabla rke2[1834650]: time="2025-11-02T22:20:00Z" level=info msg="Server 49.13.13.59:6443@PREFERRED->HEALTHY from successful health check"
Nov 02 22:20:01 worker04.blabla rke2[1834650]: time="2025-11-02T22:20:01Z" level=info msg="Server 128.140.92.225:6443@PREFERRED->HEALTHY from successful health check"

server

Copy code

Nov 04 09:07:54 master01.blabla rke2[788072]: time="2025-11-04T09:07:54Z" level=error msg="Sending HTTP/1.1 502 response to 127.0.0.1:57440: failed to find Session for client worker04.blabla"

creamy-pencil-82913

11/04/2025, 9:14 AM

which one of those IPs is master01, and what is the most recent health-check result for master01?

creamy-pencil-82913

11/04/2025, 9:16 AM

it might be more useful to search for master01's IP in the agent logs

future-jordan-75926

11/04/2025, 9:18 AM

Copy code

journalctl -u rke2-agent.service -f | grep "49.13.13.59"
Nov 02 22:18:59 worker04.blabla rke2[1834650]: time="2025-11-02T22:18:58Z" level=info msg="Server 49.13.13.59:6443@FAILED->RECOVERING from successful health check"
Nov 02 22:19:00 worker04.blabla rke2[1834650]: time="2025-11-02T22:18:59Z" level=info msg="Server 49.13.13.59:6443@RECOVERING->PREFERRED from successful health check"
Nov 02 22:20:00 worker04.blabla rke2[1834650]: time="2025-11-02T22:20:00Z" level=info msg="Server 49.13.13.59:6443@PREFERRED->HEALTHY from successful health check"

creamy-pencil-82913

11/04/2025, 9:22 AM

There will also be a bunch of “connecting to proxy” and “started tunnel to” messages. They will likely have the hostname (master01) instead of the IP, if you used a hostname as the server URL.

creamy-pencil-82913

11/04/2025, 9:23 AM

Something like this:

Copy code

INFO[0000] Updated load balancer rke2-agent-load-balancer default server: 172.17.0.4:9345
INFO[0000] Running load balancer rke2-agent-load-balancer 127.0.0.1:6444 -> [] [default: 172.17.0.4:9345]
INFO[0000] Updated load balancer rke2-api-server-agent-load-balancer default server: 172.17.0.4:6443
INFO[0000] Running load balancer rke2-api-server-agent-load-balancer 127.0.0.1:6443 -> [] [default: 172.17.0.4:6443]
INFO[0010] Got apiserver addresses from supervisor: [172.17.0.4:6443]
INFO[0010] Server 172.17.0.4:6443@STANDBY*->UNCHECKED from add to load balancer rke2-api-server-agent-load-balancer
INFO[0010] Updated load balancer rke2-api-server-agent-load-balancer server addresses -> [172.17.0.4:6443] [default: 172.17.0.4:6443]
INFO[0010] Server 172.17.0.4:9345@STANDBY*->UNCHECKED from add to load balancer rke2-agent-load-balancer
INFO[0010] Updated load balancer rke2-agent-load-balancer server addresses -> [172.17.0.4:9345] [default: 172.17.0.4:9345]
INFO[0010] Connecting to proxy                           url="<wss://172.17.0.4:9345/v1-rke2/connect>"
INFO[0010] Server 172.17.0.4:9345@UNCHECKED*->RECOVERING from successful dial
INFO[0010] Remotedialer connected to proxy               url="<wss://172.17.0.4:9345/v1-rke2/connect>"
INFO[0010] Server 172.17.0.4:6443@UNCHECKED*->RECOVERING from successful health check
INFO[0011] Server 172.17.0.4:9345@RECOVERING*->ACTIVE from successful health check
INFO[0011] Server 172.17.0.4:6443@RECOVERING*->ACTIVE from successful health check

creamy-pencil-82913

11/04/2025, 9:24 AM

I do note that agent logs are only showing errors for 6443 (apiserver), not 9345 (supervisor). Is there some problem with the apiserver pod on master01? Have you checked the apiserver pod logs?

future-jordan-75926

11/04/2025, 9:28 AM

hm, hostname doesn't show anything

future-jordan-75926

11/04/2025, 9:28 AM

Copy code

root@worker04:~# journalctl -u rke2-agent.service | grep "master01"
root@worker04:~#

future-jordan-75926

11/04/2025, 9:29 AM

logs from apiserver on master01

Copy code

1104 08:04:34.228543       1 timeout.go:140] "Post-timeout activity" logger="UnhandledError" timeElapsed="152.977µs" method="GET" path="/api/v1/namespaces/airbyte/pods/replication-job-230486-attempt-0" result=null
I1104 08:09:49.334529       1 cidrallocator.go:277] updated ClusterIP allocator for Service CIDR 10.43.0.0/16
E1104 08:12:48.656379       1 wrap.go:53] "Timeout or abort while handling" logger="UnhandledError" method="GET" URI="/api/v1/namespaces/airbyte/pods/replication-job-230484-attempt-0" auditID="92cf135f-a399-4ad4-ab98-2b357b121f28"
E1104 08:12:48.656452       1 timeout.go:140] "Post-timeout activity" logger="UnhandledError" timeElapsed="5.48µs" method="GET" path="/api/v1/namespaces/airbyte/pods/replication-job-230484-attempt-0" result=null
I1104 08:19:49.334658       1 cidrallocator.go:277] updated ClusterIP allocator for Service CIDR 10.43.0.0/16
I1104 08:29:49.335030       1 cidrallocator.go:277] updated ClusterIP allocator for Service CIDR 10.43.0.0/16
I1104 08:39:49.336135       1 cidrallocator.go:277] updated ClusterIP allocator for Service CIDR 10.43.0.0/16
I1104 08:49:49.336535       1 cidrallocator.go:277] updated ClusterIP allocator for Service CIDR 10.43.0.0/16
I1104 08:59:49.336615       1 cidrallocator.go:277] updated ClusterIP allocator for Service CIDR 10.43.0.0/16
E1104 09:06:57.924497       1 status.go:71] "Unhandled Error" err="apiserver received an error that is not an metav1.Status: &url.Error{Op:\"Get\", URL:\"<https://49.13.68.75:10250/containerLogs/airflow/airflow-worker-3/airflow-worker>\", Err:(*errors.errorString)(0xc05cb532a0)}: Get \"<https://49.13.68.75:10250/containerLogs/airflow/airflow-worker-3/airflow-worker>\": proxy error from 127.0.0.1:9345 while dialing 49.13.68.75:10250, code 502: 502 Bad Gateway" logger="UnhandledError"
E1104 09:07:16.288482       1 status.go:71] "Unhandled Error" err="apiserver received an error that is not an metav1.Status: &url.Error{Op:\"Get\", URL:\"<https://49.13.68.75:10250/containerLogs/airflow/airflow-worker-3/airflow-worker>\", Err:(*errors.errorString)(0xc04b4c2c80)}: Get \"<https://49.13.68.75:10250/containerLogs/airflow/airflow-worker-3/airflow-worker>\": proxy error from 127.0.0.1:9345 while dialing 49.13.68.75:10250, code 502: 502 Bad Gateway" logger="UnhandledError"
E1104 09:07:54.410252       1 status.go:71] "Unhandled Error" err="apiserver received an error that is not an metav1.Status: &url.Error{Op:\"Get\", URL:\"<https://49.13.68.75:10250/containerLogs/airflow/airflow-worker-3/airflow-worker>\", Err:(*errors.errorString)(0xc054191f60)}: Get \"<https://49.13.68.75:10250/containerLogs/airflow/airflow-worker-3/airflow-worker>\": proxy error from 127.0.0.1:9345 while dialing 49.13.68.75:10250, code 502: 502 Bad Gateway" logger="UnhandledError"
I1104 09:09:49.336711       1 cidrallocator.go:277] updated ClusterIP allocator for Service CIDR 10.43.0.0/16
I1104 09:19:49.337428       1 cidrallocator.go:277] updated ClusterIP allocator for Service CIDR 10.43.0.0/16

future-jordan-75926

11/04/2025, 9:29 AM

apiserver is functioning I can perform kubectl on resources and what not, just getting logs from this specific worker is the issue

creamy-pencil-82913

11/04/2025, 9:30 AM

What was going on with the apiserver around when the agent marked it as failing a dial here:

Nov 02 22:18:59 worker04.blabla rke2[1834650]: time="2025-11-02T22:18:58Z" level=info msg="Server 128.140.92.225:6443@PREFERRED->FAILED from failed dial"

creamy-pencil-82913

11/04/2025, 9:31 AM

failed dial means it tried to connect, but was unable to do so. So it was down or unreachable for some reason.

creamy-pencil-82913

11/04/2025, 9:32 AM

Is this the only server in the cluster, or are there others?

future-jordan-75926

11/04/2025, 9:35 AM

there are 3 servers in the cluster

future-jordan-75926

11/04/2025, 9:37 AM

Copy code

Nov 02 22:18:59 worker04.blabla rke2[1834650]: time="2025-11-02T22:18:58Z" level=info msg="Server 128.140.92.225:6443@PREFERRED->FAILED from failed dial"

there is no logs for this on server node

future-jordan-75926

11/04/2025, 9:38 AM

at this specific time

creamy-pencil-82913

11/04/2025, 9:38 AM

on the server side you should see a message like this whenever an agent connects to the remotedialer proxy:

INFO[0137] Handling backend connection request [<http://rke2-agent-001.example.com|rke2-agent-001.example.com>]

and then another message like this when it disconnects (unfortunately it doesn’t say who is disconnecting)

INFO[1034] error in remotedialer server [400]: websocket: close 1006 (abnormal closure): unexpected EOF

creamy-pencil-82913

11/04/2025, 9:39 AM

There are no logs for that time period at all on the server side? Was there some other network issue going on that caused it to fail to connect?

future-jordan-75926

11/04/2025, 9:43 AM

Copy code

Nov 02 02:14:02 master02.blabla rke2[691565]: time="2025-11-02T02:14:02Z" level=info msg="error in remotedialer server [400]: read tcp 128.140.92.225:9345->128.140.14.189:3028: i/o timeout"
Nov 03 14:36:37 master02.blabla rke2[691565]: time="2025-11-03T14:36:37Z" level=info msg="error in remotedialer server [400]: read tcp 128.140.92.225:9345->162.55.100.234:3204: i/o timeout"
Nov 03 14:44:55 master02.blabla rke2[691565]: time="2025-11-03T14:44:55Z" level=info msg="error in remotedialer server [400]: read tcp 128.140.92.225:9345->162.55.100.234:32768: i/o timeout"


Nov 02 02:14:04 master03.blabla rke2[689639]: time="2025-11-02T02:14:04Z" level=info msg="error in remotedialer server [400]: read tcp 23.88.46.2:9345->128.140.14.189:59666: i/o timeout"
Nov 03 14:36:23 master03.blabla rke2[689639]: time="2025-11-03T14:36:23Z" level=info msg="error in remotedialer server [400]: read tcp 23.88.46.2:9345->162.55.100.234:11820: i/o timeout"
Nov 03 14:48:56 master03.blabla rke2[689639]: time="2025-11-03T14:48:56Z" level=info msg="error in remotedialer server [400]: read tcp 23.88.46.2:9345->162.55.100.234:2608: i/o timeout"

Nov 02 02:13:55 master01.blabla rke2[788072]: time="2025-11-02T02:13:55Z" level=info msg="error in remotedialer server [400]: read tcp 49.13.13.59:9345->128.140.14.189:36168: i/o timeout"
Nov 03 14:36:35 master01.blabla rke2[788072]: time="2025-11-03T14:36:35Z" level=info msg="error in remotedialer server [400]: read tcp 49.13.13.59:9345->162.55.100.234:57974: i/o timeout"
Nov 03 14:46:41 master01.blabla rke2[788072]: time="2025-11-03T14:46:41Z" level=info msg="error in remotedialer server [400]: read tcp 49.13.13.59:9345->162.55.100.234:28184: i/o timeout"

future-jordan-75926

11/04/2025, 9:44 AM

Regarding network issue there could have been, but I should not have been long

future-jordan-75926

11/04/2025, 9:44 AM

and as I understand those healthchecks keeps retrying? and the intermitent timeout should be resolved, no?

creamy-pencil-82913

11/04/2025, 9:45 AM

the health checks do, yes. However it seems like the agent thinks it still has a connection up - since you did not see any logs on the agent side regarding redialing the websocket proxy

future-jordan-75926

11/04/2025, 9:46 AM

So, server receive the timeout for the agent, but the agent for some reason keep the connection up

future-jordan-75926

11/04/2025, 9:46 AM

And that where the `failed to find session for client`error comes up

creamy-pencil-82913

11/04/2025, 9:47 AM

if you are running with

supervisor-metrics: true

on the server, you can check the loadbalancer health metrics on individual nodes:

Copy code

kubectl get --server <https://AGENT:9345> --raw /metrics | grep rke2_loadbalancer

creamy-pencil-82913

11/04/2025, 9:48 AM

Ref: https://docs.rke2.io/reference/metrics#rke2_loadbalancer_server_health

future-jordan-75926

11/04/2025, 9:53 AM

hm, unfortunately, currently I do not

creamy-pencil-82913

11/04/2025, 9:55 AM

I am not sure what would cause the websocket client to think it’s still connected when the server says it is not. The latest release of RKE2 v1.33 (v1.33.5) uses a newer version of the remotedialer library, where all this logic lives. There’s not been a LOT of code changes, mostly packaging stuff, but it might be worth seeing if you can reproduce on that version.

creamy-pencil-82913

11/04/2025, 9:58 AM

One of the changes is to bump gorilla/websocket which could I guess be meaningful. https://github.com/gorilla/websocket/compare/v1.5.1...v1.5.3

future-jordan-75926

11/04/2025, 10:04 AM

Gotcha, will try to update to v1.33.5 version

future-jordan-75926

11/04/2025, 10:04 AM

also, maybe this could be not an rke2 issue but server configuration

future-jordan-75926

11/04/2025, 10:05 AM

I see that you have dealt with similar issue here: https://github.com/rancher/rke2/issues/7952

creamy-pencil-82913

11/04/2025, 10:07 AM

if calico is doing weird things with the network, that could definitely also cause problems

future-jordan-75926

11/04/2025, 10:08 AM

right, but we have cilium+ebpf:D

narrow-guitar-87575

11/04/2025, 10:30 AM

btw, you said you have L2 network, but according to IP addresses you providet, it seems that you have L3 network and then with cilium, you must configure vxlan tunels. Did you do so?

future-jordan-75926

11/04/2025, 11:10 AM

Oh, right, server reachability is provided by the datacenter where we rent our servers

future-jordan-75926

11/05/2025, 10:42 AM

Copy code

Nov 04 13:10:55 worker06.blabla rke2[3162564]: time="2025-11-04T13:10:55Z" level=info msg="Tunnel authorizer set Kubelet Port 0.0.0.0:10250"
Nov 04 13:11:46 worker06.blabla rke2[3162564]: time="2025-11-04T13:11:46Z" level=info msg="Server 49.13.13.59:9345@PREFERRED->HEALTHY from successful health check"
Nov 04 13:11:46 worker06.blabla rke2[3162564]: time="2025-11-04T13:11:46Z" level=info msg="Server 128.140.92.225:9345@PREFERRED->HEALTHY from successful health check"
Nov 04 13:11:46 worker06.blabla rke2[3162564]: time="2025-11-04T13:11:46Z" level=info msg="Server 23.88.46.2:6443@PREFERRED->HEALTHY from successful health check"
Nov 04 13:11:46 worker06.blabla rke2[3162564]: time="2025-11-04T13:11:46Z" level=info msg="Server 128.140.92.225:6443@PREFERRED->HEALTHY from successful health check"
Nov 05 06:04:21 worker06.blabla rke2[3162564]: time="2025-11-05T06:04:21Z" level=error msg="Error writing ping" error="write tcp 91.107.197.193:10670->23.88.46.2:9345: i/o timeout"
Nov 05 06:04:21 worker06.blabla rke2[3162564]: time="2025-11-05T06:04:21Z" level=error msg="Error writing ping" error="write tcp 91.107.197.193:10670->23.88.46.2:9345: i/o timeout"

managed to replicate issue on v1.34.1

creamy-pencil-82913

11/05/2025, 11:06 AM

Ok... so what is going on in your environment when this occurs? I/O error is coming from low-level go network stack. Question would be, what is causing it, and why isn't the websocket client recovering from it. If we can reproduce in a dev environment then we can figure out what library needs to be fixed.

narrow-guitar-87575

11/05/2025, 11:14 AM

is the websocket connected all the time? there might be short time for keepalive connections (if redeploy helps for some time, it may be the case)

future-jordan-75926

11/05/2025, 11:15 AM

Currently, I have an idea that OOM, that happens on the worker (on of the containers) causes this connection loss

future-jordan-75926

11/05/2025, 11:15 AM

Will try to reproduce this

narrow-guitar-87575

11/05/2025, 11:16 AM

you shoud see oom in

dmesg

output

future-jordan-75926

11/05/2025, 11:27 AM

yeah, I see that

future-jordan-75926

11/05/2025, 11:27 AM

that's why this idea came up

future-jordan-75926

11/05/2025, 12:15 PM

but I want emphasise that this OOM does not happen on rke2-agent service but different container

creamy-pencil-82913

11/05/2025, 5:16 PM

Do you have swap enabled? What is the CPU utilization on this node when this occurs? Is this a VM that might be getting throttled or paused under memory pressure?

future-jordan-75926

11/06/2025, 7:29 AM

workers run on baremetal, CPU utlization during OOM is around 50%

future-jordan-75926

11/06/2025, 7:30 AM

swap is disabled

creamy-pencil-82913

11/06/2025, 7:32 AM

I believe there may be a fix available for this… the engineer that is working on the similar Rancher issue hasn’t been able to verify the fix yet as the affected user hasn’t been able to validate, but it looks good to me. I will try to get the version bump into this cycle. https://github.com/k3s-io/k3s/issues/13149

future-jordan-75926

11/06/2025, 8:18 AM

great, thats nice to hear

adamant-branch-25874

11/06/2025, 10:25 AM

If the fix is this commit in remotedialer, shouldn't it then be fixed in rke2 v1.34.1+rke2r1 by this commit which updated the dependency version? The deadlock does describe exactly what I notice on my end (running 1.32.4). However Luke said above he replicated the issue on version v1.34.1.

future-jordan-75926

11/06/2025, 10:46 AM

Yeah, on version v1.34.1 I did get the following error. But I still cannot replicate it, my first idea is that some other OOM cause this but I have forced an OOM on an application running on the worker but it didn't lose the connection

adamant-branch-25874

11/06/2025, 10:50 AM

I'm managed to replicate the issue with my client with a high amount of pods which stress CPU/Network/Memory to the limits. But updating to 1.34 is not so easy for us and is only planned for Q1 of 2026.

future-jordan-75926

11/06/2025, 11:00 AM

No test environment?

adamant-branch-25874

11/06/2025, 11:23 AM

Test clusters enough, but replicating the issue is a pita.

future-jordan-75926

11/06/2025, 11:56 AM

what CNI are you using?

adamant-branch-25874

11/06/2025, 3:07 PM

I use callico in Host-gw mode. I think this is one of the simplest networks setups you can use.

creamy-pencil-82913

11/06/2025, 7:51 PM

We don’t have any branches that include the locking fix. The bump we’re making is: https://github.com/rancher/remotedialer/compare/e6b68fd83a6b...f160aa32568d

adamant-branch-25874

11/07/2025, 8:13 AM

Oh, I see now. The GO pseudo-version is of the commit just before the fix, so it's indeed not included yet. I see you created issues to have this backported, thanks for this.

14 Views

Open in Slack

Previous Next