Hi folks, I am wondering if someone can help me wi...
# k3s
a
Hi folks, I am wondering if someone can help me with the following: We are setting up a cluster with the k3s control plane running via containers on a separate kubernetes cluster. In front of this we have a kubernetes LB service which points a static IP to three instances of the control plane. The control plane is using an external DB which is hosted on the same cluster. We then have two baremetal nodes running on a different provider, connecting to the control plane, acting as agent only nodes. These have connected to the cluster successfully and are showing in the kubernetes node list as ready. We seem to be having issues with any kubernetes API -> cluster connectivity (logs, exec, port-forward, etc). When running any of those commands against the cluster, I see the following error in the server pods with
--debug
turned on:
Copy code
Tunnel server egress proxy dial error: failed to find Session for client <node>
During the startup of the agent service on both nodes I can see logs like so:
Copy code
time="2025-02-20T16:44:33Z" level=info msg="Starting k3s agent v1.31.5+k3s1 (56ec5dd4)"
time="2025-02-20T16:44:33Z" level=info msg="Updated load balancer k3s-agent-load-balancer default server: 34.x.x.x:6443"
time="2025-02-20T16:44:33Z" level=info msg="Adding server to load balancer k3s-agent-load-balancer: 10.16.43.69:6443"
time="2025-02-20T16:44:33Z" level=info msg="Adding server to load balancer k3s-agent-load-balancer: 10.16.20.47:6443"
time="2025-02-20T16:44:33Z" level=info msg="Adding server to load balancer k3s-agent-load-balancer: 10.16.32.61:6443"
time="2025-02-20T16:44:33Z" level=info msg="Updated load balancer k3s-agent-load-balancer server addresses -> [10.16.43.69:6443 10.16.20.47:6443 10.16.32.61:6443] [default: 34.x.x.x:6443]"
time="2025-02-20T16:44:33Z" level=info msg="Running load balancer k3s-agent-load-balancer 127.0.0.1:6444 -> [10.16.43.69:6443 10.16.20.47:6443 10.16.32.61:6443] [default: 34.x.x.x:6443]"
time="2025-02-20T16:44:43Z" level=info msg="Server 10.16.43.69:6443@UNCHECKED->FAILED from failed dial"
time="2025-02-20T16:44:53Z" level=info msg="Server 10.16.20.47:6443@UNCHECKED->FAILED from failed dial"
time="2025-02-20T16:44:53Z" level=error msg="failed to get CA certs: Get \"<https://127.0.0.1:6444/cacerts>\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
time="2025-02-20T16:45:03Z" level=info msg="Server 10.16.32.61:6443@UNCHECKED->FAILED from failed dial"
time="2025-02-20T16:45:08Z" level=info msg="Module overlay was already loaded"
time="2025-02-20T16:45:08Z" level=info msg="Module nf_conntrack was already loaded"
time="2025-02-20T16:45:08Z" level=info msg="Module br_netfilter was already loaded"
time="2025-02-20T16:45:08Z" level=info msg="Module iptable_nat was already loaded"
time="2025-02-20T16:45:08Z" level=info msg="Module iptable_filter was already loaded"
The following addresses
10.16.43.69:6443 10.16.20.47:6443 10.16.32.61:6443
seem to be the control plane pod IPs, and the
34.x.x.x:6443
address is the LB exposing the control plane. I found a similar issue: https://github.com/k3s-io/k3s/issues/6698 but I don't think the resolution/issue is entirely the same in this case.
c
Agents expect to be able to connect directly to the server IPs. If the server nodes are actually pods running somewhere, with inaccessible addresses, that will not work.
a
This works because the nodes are in the same subnet as the agents?
c
Right but even if you do that, the agents still need to be able to connect directly to the servers. The external LB is just used to provide a fixed registration address that the agents initially use to find a server. Once they are connected they switch over to connecting directly to the servers.
All cluster members need to be able to connect directly to each other.
a
Okay that makes sense, so the best option here is to establish L3 connectivity in some way via wireguard or some other solution?
c
idk that you can run wireguard in a pod, I’ve not tried it.
a
This is probably not the place to ask, but have you used k0s for a setup like this before? They seem to advertise supporting a setup along these lines, but I was wondering if I would end up in the same situation with that.
c
no, I have never used k0s
What you’re doing here might work better if you looked at it as a virtual control-plane, and ran the servers with --disable-agent so that they are not full members of the cluster. This isn’t a supported configuration but neither is running in an environment that lacks full connectivity between all nodes.
a
Unfortunately they do already have --disable-agent. These are the flags I am passing:
Copy code
- server
        - --disable-agent
        - --disable=coredns,servicelb,traefik
        - --tls-san={{tlsSan}}
        - --flannel-backend=none
        - --egress-selector-mode=cluster
c
yeah, so do you know how
kubectl logs
and
kubectl exec
work? And what the egress-selector is doing?
a
Correct me if I am wrong but essentially the k8s control plane establishes a websocket? connection to the kubelet on the node which executes the action. As for the egress selector, I am not sure.
c
right so when you run one of those commands, the apiserver makes a connection to the kubelet to pull logs, or run the command in the pod and pipe output back to the client. This means that the server MUST be able to open a connection to the kubelet. I am guessing that your server pods can’t connect to the agents? K3s includes an embedded egress proxy so that the apiserver can connect to kubelets, using a websocket tunnel connection initiated by the agent. This means that you only need agent -> server connectivity, not agent - server. However, this does mean that the agents need to be able to connect directly to all of the servers, so that every server has the websocket tunnel to use when it needs to talk to that kubelet. The problem you have in your environment, is that servers can’t connect to agents, and agents can’t connect to servers. All anything can connect to is the LB you’ve put in front of the servers.
Basically, you need to rearchitect your design so that you have at least SOME sort of functional connectivity between agents and servers. You cannot rely on the reverse proxy in front of the servers handling everything.
a
Thank you for the detailed explanation and help, looks like I have some work to do 🙂