This message was deleted.
# rke2
a
This message was deleted.
c
This looks like networking problems. Why are you getting IO timeouts on an established TCP socket? What is on the network between these two nodes?
b
the network stack is node -> 100GbE switch -> hypervisor -> openvswitch->vm (control plane)
but the path looks quite clean to me
c
if you do a packet capture on either side, what do you see? Where are the packets being dropped?
that’s just a websocket tunnel, so https… is it going through a firewall or something else that has an idle timeout?
b
nope, just rke2-server sitting on the raw network. no firewall, load balancers, etc
c
something’s dropping the connection. that io timeout is bubbled up from the OS sockets layer.
b
okay, i'll start digging there
thanks
c
I’d start pulling packet captures from either side and see where things disappear
b
weird that it wouldn't happen from other workers
it's so infrequent (from a network perspective) it's gonna be awful hard to comb
so i'm clear on the stack here - this would be literally rke2 agent to rke2 server traffic? no pods involved, just the daemon on both nodes?
c
yep. process running as a systemd service on agent node, to similar process on server node.
b
alright, thanks, that simplifies things considerably 🙂