Hello, have anyone experience this issue on RKE2 v...
# rke2
f
Hello, have anyone experience this issue on RKE2 version: v1.33.0+rke2r1 Getting logs from pods on workers return the following error. Restarting rke2 service on the node resolve this issue temporarily
Copy code
Get \\"<https://49.13.68.75:10250/containerLogs/airflow/dim-paypal-checked-list-ta5tu5lt/base?follow=true>\\u0026timestamps=true\\": proxy error from 127.0.0.1:9345 while dialing 49.13.68.75:10250, code 502: 502 Bad Gateway","code":500
a
I have the exact same issue on one of my clusters. Still need to investigate this further though. If you got any more leads to the root cause, please share 😄
n
isn't something altering iptables rules?
a
iptables is disabled at my end. It seems some internal tunnel breaks without being noticed.
f
But it seems that rke2 proxy that is deployed on master nodes fails, right?
yep, also no iptables on my end also
n
and is there any log in the proxy, why it is failing?
f
Can't find it, by proxy I mean this service that runs on 9345 port
n
but kube-proxy pod has no error?
but actually, 10250 is kubelet. no errors in /var/lib/rancher/rke2/agent/kubelet.log ?
f
We don't have kube-proxy, we're using cilium with ebpf
n
are nodes l2 reachable or vxlans?
f
l2
n
do you have direct routes?
c
This indicates that the websocket tunnel from the agent to the server is getting disconnected and cannot reconnect. Check the logs on the agent for messages about remotedialer/websocket
a
In my case, 10Gbps very low latency network. Canal for networking using host-gateway, so direct L2 connectivity. Even debug loglevel didn't bring me closer to finding an exact indication in the logs. Research indeed indicated that the websocket tunnel goes down, but I would expect some sort of healthcheck on this tunnel to restart it in case it goes down. (Or at least more logging about this problem)
c
It does have a health check, and you should see errors from it trying to reconnect every 30 seconds or so.
Can you confirm that all of your servers have distinct IPs? Show the output of
kubectl get node -o wide
and
kubectl get endpoints -n default kubernetes -o yaml
There should be a unique endpoint address listed for the Kubernetes service for each of your servers.
a
I have checked and all my servers have unique ips. (I would also expect a lot more issues when they're not unique) At this time, I don't have an installation were the tunnel is broken. But at the last occurence it was broken for 2 weeks before it was noticed.
c
provide logs, when you can.
also, info on what version you’re on - since you’re replying to someone else’s thread
f
I think this issue is present on all of the versions, though
Will provide logs from kubelet
c
there have been many improvements to the agent tunnel and health checking over the last year. I am not aware of any issues in the latest releases.
not kubelet logs. rke2-agent logs from journald.
and rke2-server logs from all servers, for the same timeframe
f
Copy code
k logs airflow-worker-3
Defaulted container "airflow-worker" out of: airflow-worker, dags-git-sync, log-cleanup, dags-git-clone (init), check-db (init), wait-for-db-migrations (init)
Error from server: Get "<https://49.13.68.75:10250/containerLogs/airflow/airflow-worker-3/airflow-worker>": proxy error from 127.0.0.1:9345 while dialing 49.13.68.75:10250, code 502: 502 Bad Gateway
agent
Copy code
root@worker04:~# journalctl -u rke2-agent.service -f
Nov 02 22:18:59 worker04.blabla rke2[1834650]: time="2025-11-02T22:17:17Z" level=info msg="Server 23.88.46.2:6443@PREFERRED->HEALTHY from successful health check"
Nov 02 22:18:59 worker04.blabla rke2[1834650]: time="2025-11-02T22:18:04Z" level=info msg="Server 128.140.92.225:6443@RECOVERING->PREFERRED from successful health check"
Nov 02 22:18:59 worker04.blabla rke2[1834650]: time="2025-11-02T22:18:58Z" level=info msg="Server 49.13.13.59:6443@FAILED->RECOVERING from successful health check"
Nov 02 22:18:59 worker04.blabla rke2[1834650]: time="2025-11-02T22:18:58Z" level=info msg="Server 128.140.92.225:6443@PREFERRED->FAILED from failed dial"
Nov 02 22:18:59 worker04.blabla rke2[1834650]: time="2025-11-02T22:18:58Z" level=info msg="Server 23.88.46.2:6443@HEALTHY->ACTIVE from successful dial"
Nov 02 22:19:00 worker04.blabla rke2[1834650]: time="2025-11-02T22:18:59Z" level=info msg="Server 49.13.13.59:6443@RECOVERING->PREFERRED from successful health check"
Nov 02 22:19:00 worker04.blabla rke2[1834650]: time="2025-11-02T22:19:00Z" level=info msg="Server 128.140.92.225:6443@FAILED->RECOVERING from successful health check"
Nov 02 22:19:01 worker04.blabla rke2[1834650]: time="2025-11-02T22:19:01Z" level=info msg="Server 128.140.92.225:6443@RECOVERING->PREFERRED from successful health check"
Nov 02 22:20:00 worker04.blabla rke2[1834650]: time="2025-11-02T22:20:00Z" level=info msg="Server 49.13.13.59:6443@PREFERRED->HEALTHY from successful health check"
Nov 02 22:20:01 worker04.blabla rke2[1834650]: time="2025-11-02T22:20:01Z" level=info msg="Server 128.140.92.225:6443@PREFERRED->HEALTHY from successful health check"
server
Copy code
Nov 04 09:07:54 master01.blabla rke2[788072]: time="2025-11-04T09:07:54Z" level=error msg="Sending HTTP/1.1 502 response to 127.0.0.1:57440: failed to find Session for client worker04.blabla"
c
which one of those IPs is master01, and what is the most recent health-check result for master01?
it might be more useful to search for master01's IP in the agent logs
f
Copy code
journalctl -u rke2-agent.service -f | grep "49.13.13.59"
Nov 02 22:18:59 worker04.blabla rke2[1834650]: time="2025-11-02T22:18:58Z" level=info msg="Server 49.13.13.59:6443@FAILED->RECOVERING from successful health check"
Nov 02 22:19:00 worker04.blabla rke2[1834650]: time="2025-11-02T22:18:59Z" level=info msg="Server 49.13.13.59:6443@RECOVERING->PREFERRED from successful health check"
Nov 02 22:20:00 worker04.blabla rke2[1834650]: time="2025-11-02T22:20:00Z" level=info msg="Server 49.13.13.59:6443@PREFERRED->HEALTHY from successful health check"
c
There will also be a bunch of “connecting to proxy” and “started tunnel to” messages. They will likely have the hostname (master01) instead of the IP, if you used a hostname as the server URL.
Something like this:
Copy code
INFO[0000] Updated load balancer rke2-agent-load-balancer default server: 172.17.0.4:9345
INFO[0000] Running load balancer rke2-agent-load-balancer 127.0.0.1:6444 -> [] [default: 172.17.0.4:9345]
INFO[0000] Updated load balancer rke2-api-server-agent-load-balancer default server: 172.17.0.4:6443
INFO[0000] Running load balancer rke2-api-server-agent-load-balancer 127.0.0.1:6443 -> [] [default: 172.17.0.4:6443]
INFO[0010] Got apiserver addresses from supervisor: [172.17.0.4:6443]
INFO[0010] Server 172.17.0.4:6443@STANDBY*->UNCHECKED from add to load balancer rke2-api-server-agent-load-balancer
INFO[0010] Updated load balancer rke2-api-server-agent-load-balancer server addresses -> [172.17.0.4:6443] [default: 172.17.0.4:6443]
INFO[0010] Server 172.17.0.4:9345@STANDBY*->UNCHECKED from add to load balancer rke2-agent-load-balancer
INFO[0010] Updated load balancer rke2-agent-load-balancer server addresses -> [172.17.0.4:9345] [default: 172.17.0.4:9345]
INFO[0010] Connecting to proxy                           url="<wss://172.17.0.4:9345/v1-rke2/connect>"
INFO[0010] Server 172.17.0.4:9345@UNCHECKED*->RECOVERING from successful dial
INFO[0010] Remotedialer connected to proxy               url="<wss://172.17.0.4:9345/v1-rke2/connect>"
INFO[0010] Server 172.17.0.4:6443@UNCHECKED*->RECOVERING from successful health check
INFO[0011] Server 172.17.0.4:9345@RECOVERING*->ACTIVE from successful health check
INFO[0011] Server 172.17.0.4:6443@RECOVERING*->ACTIVE from successful health check
I do note that agent logs are only showing errors for 6443 (apiserver), not 9345 (supervisor). Is there some problem with the apiserver pod on master01? Have you checked the apiserver pod logs?
f
hm, hostname doesn't show anything
Copy code
root@worker04:~# journalctl -u rke2-agent.service | grep "master01"
root@worker04:~#
logs from apiserver on master01
Copy code
1104 08:04:34.228543       1 timeout.go:140] "Post-timeout activity" logger="UnhandledError" timeElapsed="152.977”s" method="GET" path="/api/v1/namespaces/airbyte/pods/replication-job-230486-attempt-0" result=null
I1104 08:09:49.334529       1 cidrallocator.go:277] updated ClusterIP allocator for Service CIDR 10.43.0.0/16
E1104 08:12:48.656379       1 wrap.go:53] "Timeout or abort while handling" logger="UnhandledError" method="GET" URI="/api/v1/namespaces/airbyte/pods/replication-job-230484-attempt-0" auditID="92cf135f-a399-4ad4-ab98-2b357b121f28"
E1104 08:12:48.656452       1 timeout.go:140] "Post-timeout activity" logger="UnhandledError" timeElapsed="5.48”s" method="GET" path="/api/v1/namespaces/airbyte/pods/replication-job-230484-attempt-0" result=null
I1104 08:19:49.334658       1 cidrallocator.go:277] updated ClusterIP allocator for Service CIDR 10.43.0.0/16
I1104 08:29:49.335030       1 cidrallocator.go:277] updated ClusterIP allocator for Service CIDR 10.43.0.0/16
I1104 08:39:49.336135       1 cidrallocator.go:277] updated ClusterIP allocator for Service CIDR 10.43.0.0/16
I1104 08:49:49.336535       1 cidrallocator.go:277] updated ClusterIP allocator for Service CIDR 10.43.0.0/16
I1104 08:59:49.336615       1 cidrallocator.go:277] updated ClusterIP allocator for Service CIDR 10.43.0.0/16
E1104 09:06:57.924497       1 status.go:71] "Unhandled Error" err="apiserver received an error that is not an metav1.Status: &url.Error{Op:\"Get\", URL:\"<https://49.13.68.75:10250/containerLogs/airflow/airflow-worker-3/airflow-worker>\", Err:(*errors.errorString)(0xc05cb532a0)}: Get \"<https://49.13.68.75:10250/containerLogs/airflow/airflow-worker-3/airflow-worker>\": proxy error from 127.0.0.1:9345 while dialing 49.13.68.75:10250, code 502: 502 Bad Gateway" logger="UnhandledError"
E1104 09:07:16.288482       1 status.go:71] "Unhandled Error" err="apiserver received an error that is not an metav1.Status: &url.Error{Op:\"Get\", URL:\"<https://49.13.68.75:10250/containerLogs/airflow/airflow-worker-3/airflow-worker>\", Err:(*errors.errorString)(0xc04b4c2c80)}: Get \"<https://49.13.68.75:10250/containerLogs/airflow/airflow-worker-3/airflow-worker>\": proxy error from 127.0.0.1:9345 while dialing 49.13.68.75:10250, code 502: 502 Bad Gateway" logger="UnhandledError"
E1104 09:07:54.410252       1 status.go:71] "Unhandled Error" err="apiserver received an error that is not an metav1.Status: &url.Error{Op:\"Get\", URL:\"<https://49.13.68.75:10250/containerLogs/airflow/airflow-worker-3/airflow-worker>\", Err:(*errors.errorString)(0xc054191f60)}: Get \"<https://49.13.68.75:10250/containerLogs/airflow/airflow-worker-3/airflow-worker>\": proxy error from 127.0.0.1:9345 while dialing 49.13.68.75:10250, code 502: 502 Bad Gateway" logger="UnhandledError"
I1104 09:09:49.336711       1 cidrallocator.go:277] updated ClusterIP allocator for Service CIDR 10.43.0.0/16
I1104 09:19:49.337428       1 cidrallocator.go:277] updated ClusterIP allocator for Service CIDR 10.43.0.0/16
apiserver is functioning I can perform kubectl on resources and what not, just getting logs from this specific worker is the issue
c
What was going on with the apiserver around when the agent marked it as failing a dial here:
Nov 02 22:18:59 worker04.blabla rke2[1834650]: time="2025-11-02T22:18:58Z" level=info msg="Server 128.140.92.225:6443@PREFERRED->FAILED from failed dial"
failed dial means it tried to connect, but was unable to do so. So it was down or unreachable for some reason.
Is this the only server in the cluster, or are there others?
f
there are 3 servers in the cluster
Copy code
Nov 02 22:18:59 worker04.blabla rke2[1834650]: time="2025-11-02T22:18:58Z" level=info msg="Server 128.140.92.225:6443@PREFERRED->FAILED from failed dial"
there is no logs for this on server node
at this specific time
c
on the server side you should see a message like this whenever an agent connects to the remotedialer proxy:
INFO[0137] Handling backend connection request [<http://rke2-agent-001.example.com|rke2-agent-001.example.com>]
and then another message like this when it disconnects (unfortunately it doesn’t say who is disconnecting)
INFO[1034] error in remotedialer server [400]: websocket: close 1006 (abnormal closure): unexpected EOF
There are no logs for that time period at all on the server side? Was there some other network issue going on that caused it to fail to connect?
f
Copy code
Nov 02 02:14:02 master02.blabla rke2[691565]: time="2025-11-02T02:14:02Z" level=info msg="error in remotedialer server [400]: read tcp 128.140.92.225:9345->128.140.14.189:3028: i/o timeout"
Nov 03 14:36:37 master02.blabla rke2[691565]: time="2025-11-03T14:36:37Z" level=info msg="error in remotedialer server [400]: read tcp 128.140.92.225:9345->162.55.100.234:3204: i/o timeout"
Nov 03 14:44:55 master02.blabla rke2[691565]: time="2025-11-03T14:44:55Z" level=info msg="error in remotedialer server [400]: read tcp 128.140.92.225:9345->162.55.100.234:32768: i/o timeout"


Nov 02 02:14:04 master03.blabla rke2[689639]: time="2025-11-02T02:14:04Z" level=info msg="error in remotedialer server [400]: read tcp 23.88.46.2:9345->128.140.14.189:59666: i/o timeout"
Nov 03 14:36:23 master03.blabla rke2[689639]: time="2025-11-03T14:36:23Z" level=info msg="error in remotedialer server [400]: read tcp 23.88.46.2:9345->162.55.100.234:11820: i/o timeout"
Nov 03 14:48:56 master03.blabla rke2[689639]: time="2025-11-03T14:48:56Z" level=info msg="error in remotedialer server [400]: read tcp 23.88.46.2:9345->162.55.100.234:2608: i/o timeout"

Nov 02 02:13:55 master01.blabla rke2[788072]: time="2025-11-02T02:13:55Z" level=info msg="error in remotedialer server [400]: read tcp 49.13.13.59:9345->128.140.14.189:36168: i/o timeout"
Nov 03 14:36:35 master01.blabla rke2[788072]: time="2025-11-03T14:36:35Z" level=info msg="error in remotedialer server [400]: read tcp 49.13.13.59:9345->162.55.100.234:57974: i/o timeout"
Nov 03 14:46:41 master01.blabla rke2[788072]: time="2025-11-03T14:46:41Z" level=info msg="error in remotedialer server [400]: read tcp 49.13.13.59:9345->162.55.100.234:28184: i/o timeout"
Regarding network issue there could have been, but I should not have been long
and as I understand those healthchecks keeps retrying? and the intermitent timeout should be resolved, no?
c
the health checks do, yes. However it seems like the agent thinks it still has a connection up - since you did not see any logs on the agent side regarding redialing the websocket proxy
f
So, server receive the timeout for the agent, but the agent for some reason keep the connection up
And that where the `failed to find session for client`error comes up
c
if you are running with
supervisor-metrics: true
on the server, you can check the loadbalancer health metrics on individual nodes:
Copy code
kubectl get --server <https://AGENT:9345> --raw /metrics | grep rke2_loadbalancer
f
hm, unfortunately, currently I do not
c
I am not sure what would cause the websocket client to think it’s still connected when the server says it is not. The latest release of RKE2 v1.33 (v1.33.5) uses a newer version of the remotedialer library, where all this logic lives. There’s not been a LOT of code changes, mostly packaging stuff, but it might be worth seeing if you can reproduce on that version.
One of the changes is to bump gorilla/websocket which could I guess be meaningful. https://github.com/gorilla/websocket/compare/v1.5.1...v1.5.3
f
Gotcha, will try to update to v1.33.5 version
also, maybe this could be not an rke2 issue but server configuration
I see that you have dealt with similar issue here: https://github.com/rancher/rke2/issues/7952
c
if calico is doing weird things with the network, that could definitely also cause problems
f
right, but we have cilium+ebpf:D
n
btw, you said you have L2 network, but according to IP addresses you providet, it seems that you have L3 network and then with cilium, you must configure vxlan tunels. Did you do so?
f
Oh, right, server reachability is provided by the datacenter where we rent our servers
Copy code
Nov 04 13:10:55 worker06.blabla rke2[3162564]: time="2025-11-04T13:10:55Z" level=info msg="Tunnel authorizer set Kubelet Port 0.0.0.0:10250"
Nov 04 13:11:46 worker06.blabla rke2[3162564]: time="2025-11-04T13:11:46Z" level=info msg="Server 49.13.13.59:9345@PREFERRED->HEALTHY from successful health check"
Nov 04 13:11:46 worker06.blabla rke2[3162564]: time="2025-11-04T13:11:46Z" level=info msg="Server 128.140.92.225:9345@PREFERRED->HEALTHY from successful health check"
Nov 04 13:11:46 worker06.blabla rke2[3162564]: time="2025-11-04T13:11:46Z" level=info msg="Server 23.88.46.2:6443@PREFERRED->HEALTHY from successful health check"
Nov 04 13:11:46 worker06.blabla rke2[3162564]: time="2025-11-04T13:11:46Z" level=info msg="Server 128.140.92.225:6443@PREFERRED->HEALTHY from successful health check"
Nov 05 06:04:21 worker06.blabla rke2[3162564]: time="2025-11-05T06:04:21Z" level=error msg="Error writing ping" error="write tcp 91.107.197.193:10670->23.88.46.2:9345: i/o timeout"
Nov 05 06:04:21 worker06.blabla rke2[3162564]: time="2025-11-05T06:04:21Z" level=error msg="Error writing ping" error="write tcp 91.107.197.193:10670->23.88.46.2:9345: i/o timeout"
managed to replicate issue on v1.34.1
c
Ok... so what is going on in your environment when this occurs? I/O error is coming from low-level go network stack. Question would be, what is causing it, and why isn't the websocket client recovering from it. If we can reproduce in a dev environment then we can figure out what library needs to be fixed.
n
is the websocket connected all the time? there might be short time for keepalive connections (if redeploy helps for some time, it may be the case)
f
Currently, I have an idea that OOM, that happens on the worker (on of the containers) causes this connection loss
Will try to reproduce this
n
you shoud see oom in
dmesg
output
f
yeah, I see that
that's why this idea came up
but I want emphasise that this OOM does not happen on rke2-agent service but different container
c
Do you have swap enabled? What is the CPU utilization on this node when this occurs? Is this a VM that might be getting throttled or paused under memory pressure?
f
workers run on baremetal, CPU utlization during OOM is around 50%
swap is disabled
c
I believe there may be a fix available for this
 the engineer that is working on the similar Rancher issue hasn’t been able to verify the fix yet as the affected user hasn’t been able to validate, but it looks good to me. I will try to get the version bump into this cycle. https://github.com/k3s-io/k3s/issues/13149
f
great, thats nice to hear
a
If the fix is this commit in remotedialer, shouldn't it then be fixed in rke2 v1.34.1+rke2r1 by this commit which updated the dependency version? The deadlock does describe exactly what I notice on my end (running 1.32.4). However Luke said above he replicated the issue on version v1.34.1.
f
Yeah, on version v1.34.1 I did get the following error. But I still cannot replicate it, my first idea is that some other OOM cause this but I have forced an OOM on an application running on the worker but it didn't lose the connection
a
I'm managed to replicate the issue with my client with a high amount of pods which stress CPU/Network/Memory to the limits. But updating to 1.34 is not so easy for us and is only planned for Q1 of 2026.
f
No test environment?
a
Test clusters enough, but replicating the issue is a pita.
f
what CNI are you using?
a
I use callico in Host-gw mode. I think this is one of the simplest networks setups you can use.
c
We don’t have any branches that include the locking fix. The bump we’re making is: https://github.com/rancher/remotedialer/compare/e6b68fd83a6b...f160aa32568d
a
Oh, I see now. The GO pseudo-version is of the commit just before the fix, so it's indeed not included yet. I see you created issues to have this backported, thanks for this.