This message was deleted.
# k3s
a
This message was deleted.
b
What you get when executing
kubectl get nodes -o wide
?
a
Copy code
NAME       STATUS   ROLES    AGE    VERSION         INTERNAL-IP   EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
node3      Ready    <none>   271d   v1.19.16+k3s1   10.0.1.4      10.0.1.4      Ubuntu 20.04.3 LTS   5.4.0-126-generic   <containerd://1.4.11-k3s1>
node2      Ready    <none>   271d   v1.19.16+k3s1   10.0.1.3      10.0.1.3      Ubuntu 20.04.3 LTS   5.4.0-126-generic   <containerd://1.4.11-k3s1>
node1      Ready    <none>   271d   v1.19.16+k3s1   10.0.1.2      10.0.1.2      Ubuntu 20.04.3 LTS   5.4.0-126-generic   <containerd://1.4.11-k3s1>
control1   Ready    master   271d   v1.19.16+k3s1   10.0.0.6      10.0.0.6      Ubuntu 20.04.3 LTS   5.4.0-126-generic   <containerd://1.4.11-k3s1>
control2   Ready    master   271d   v1.19.16+k3s1   10.0.0.5      10.0.0.5      Ubuntu 20.04.3 LTS   5.4.0-126-generic   <containerd://1.4.11-k3s1>
control3   Ready    master   271d   v1.19.16+k3s1   10.0.0.7      10.0.0.7      Ubuntu 20.04.3 LTS   5.4.0-126-generic   <containerd://1.4.11-k3s1>
Systemd service is running with this command on masters:
Copy code
k3s server --tls-san <http://gate.nellcorp.com|gate.nellcorp.com> \
--datastore-endpoint ${DB_URL} \
--flannel-backend=host-gw \
--token ${TOKEN} \
--advertise-address=${NODE_IP} \
--node-ip=${NODE_IP} \
--node-external-ip=${NODE_IP} \
--flannel-iface=ens10 \
--node-taint=k3s-controlplane=true:NoSchedule \
--private-registry=/home/ubuntu/.k3s/registries.yaml \
--kube-apiserver-arg=token-auth-file=${TOKEN_PATH} \
--kubelet-arg=cluster-dns=1.1.1.1 \
--kubelet-arg=cluster-domain=cluster.local
c
Why are you overriding the kubelet’s cluster-dns and cluster-domain settings?
also that is a very old and unsupported release of K3s. 1.22 is just about to go end-of-life, everything older than that has been unsupported for months if not longer.
a
Will update in a few. And was overriding while diagnosing the issue, as initially I assumed it was a dns issue. This was happening even without the override.
Cluster updated, still no cni0 on masters
Copy code
NAME       STATUS   ROLES                  AGE    VERSION        INTERNAL-IP   EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
control3   Ready    control-plane,master   272d   v1.24.4+k3s1   10.0.0.7      10.0.0.7      Ubuntu 20.04.3 LTS   5.4.0-126-generic   <containerd://1.6.6-k3s1>
node1      Ready    <none>                 272d   v1.24.4+k3s1   10.0.1.2      10.0.1.2      Ubuntu 20.04.3 LTS   5.4.0-126-generic   <containerd://1.6.6-k3s1>
node2      Ready    <none>                 272d   v1.24.4+k3s1   10.0.1.3      10.0.1.3      Ubuntu 20.04.3 LTS   5.4.0-126-generic   <containerd://1.6.6-k3s1>
node3      Ready    <none>                 272d   v1.24.4+k3s1   10.0.1.4      10.0.1.4      Ubuntu 20.04.3 LTS   5.4.0-126-generic   <containerd://1.6.6-k3s1>
control1   Ready    control-plane,master   272d   v1.24.4+k3s1   10.0.0.6      10.0.0.6      Ubuntu 20.04.3 LTS   5.4.0-126-generic   <containerd://1.6.6-k3s1>
control2   Ready    control-plane,master   272d   v1.24.4+k3s1   10.0.0.5      10.0.0.5      Ubuntu 20.04.3 LTS   5.4.0-126-generic   <containerd://1.6.6-k3s1>
I guess my question is, is this expected behavior? Should master nodes not have cni0?
c
no, they should have the same CNI bits as the agents.
Can you start the server with --debug and post the k3s-server logs from startup onwards?
I suspect something is going wrong with your flannel config, although it’s odd that the nodes would be coming Ready with a broken CNI
a
@creamy-pencil-82913 I've removed logs from before enabling debug, let me know if this is enough. Thanks for helping!
Also, re: flannel config, I'm not really configuring it other than setting the flannel-backend to host-gw, and flannel-iface to the interface in each node that connects to all other nodes. I'm doing that bc I don't want any traffic to leave these 2 subnets.
Also, what I think could be the issue, is this:
Copy code
Sep 27 10:25:10 control1 k3s[8063]: time="2022-09-27T10:25:10Z" level=debug msg="Creating the CNI conf in directory /var/lib/rancher/k3s/agent/etc/cni/net.d"
Sep 27 10:25:10 control1 k3s[8063]: time="2022-09-27T10:25:10Z" level=debug msg="Creating the flannel configuration for backend host-gw in file /var/lib/rancher/k3s/agent/etc/flannel/net-conf.json"
Sep 27 10:25:10 control1 k3s[8063]: time="2022-09-27T10:25:10Z" level=debug msg="The flannel configuration is {\n\t\"Network\": \"10.42.0.0/16\",\n\t\"EnableIPv6\": false,\n\t\"EnableIPv4\": true,\n\t\"IPv6Network\": \"::/0\",\n\t\"Backend\": {\n\t\"Type\": \"host-gw\"\n}\n}\n"
So flannel config is being set in
/var/lib/rancher/k3s/agent
, but isn't this directory only read k3s in agent mode? As in, would the server not just ignore it?
And this is probably what's preventing the interface from setting up:
Copy code
Sep 27 12:03:17 control1 k3s[12756]: time="2022-09-27T12:03:17Z" level=info msg="Running flannel backend."
Sep 27 12:03:17 control1 k3s[12756]: I0927 12:03:17.173846   12756 route_network.go:55] Watching for new subnet leases
Sep 27 12:03:17 control1 k3s[12756]: I0927 12:03:17.191596   12756 route_network.go:92] Subnet added: 10.42.4.0/24 via 10.0.1.3
Sep 27 12:03:17 control1 k3s[12756]: E0927 12:03:17.192455   12756 route_network.go:167] Error adding route to {Ifindex: 3 Dst: 10.42.4.0/24 Src: <nil> Gw: 10.0.1.3 Flags: [] Table: 0 Realm: 0}: network is unreachable
Sep 27 12:03:17 control1 k3s[12756]: I0927 12:03:17.192769   12756 route_network.go:92] Subnet added: 10.42.5.0/24 via 10.0.1.4
Sep 27 12:03:17 control1 k3s[12756]: E0927 12:03:17.193238   12756 route_network.go:167] Error adding route to {Ifindex: 3 Dst: 10.42.5.0/24 Src: <nil> Gw: 10.0.1.4 Flags: [] Table: 0 Realm: 0}: network is unreachable
Sep 27 12:03:17 control1 k3s[12756]: I0927 12:03:17.193490   12756 route_network.go:92] Subnet added: 10.42.3.0/24 via 10.0.1.2
Sep 27 12:03:17 control1 k3s[12756]: E0927 12:03:17.193935   12756 route_network.go:167] Error adding route to {Ifindex: 3 Dst: 10.42.3.0/24 Src: <nil> Gw: 10.0.1.2 Flags: [] Table: 0 Realm: 0}: network is unreachable
Sep 27 12:03:17 control1 k3s[12756]: I0927 12:03:17.194213   12756 route_network.go:92] Subnet added: 10.42.1.0/24 via 65.21.182.167
Sep 27 12:03:17 control1 k3s[12756]: E0927 12:03:17.194817   12756 route_network.go:167] Error adding route to {Ifindex: 3 Dst: 10.42.1.0/24 Src: <nil> Gw: 65.21.182.167 Flags: [] Table: 0 Realm: 0}: network is unreachable
Sep 27 12:03:17 control1 k3s[12756]: I0927 12:03:17.195117   12756 route_network.go:92] Subnet added: 10.42.2.0/24 via 65.108.53.74
Sep 27 12:03:17 control1 k3s[12756]: E0927 12:03:17.195838   12756 route_network.go:167] Error adding route to {Ifindex: 3 Dst: 10.42.2.0/24 Src: <nil> Gw: 65.108.53.74 Flags: [] Table: 0 Realm: 0}: network is unreachable
It can't find a route to the podCIDR via the worker nodes. Which is strange, as I can ping any one of the nodes from the master
I also suspect this is due to the masters being in a different subnet and thus needing a gateway to reach the workers. But seems like workers setup routes on their own:
b
If you use
host-gw
, all nodes (masters and workers) need to be in the same network, otherwise the routes will not be able to be created. That's the log you are getting
why did you choose
host-gw
instead of the default
vxlan
?
a
Wanted simplicity and I had already setup an internal network across nodes. this is in hetzner so I setup a vswitch
b
https://github.com/flannel-io/flannel/blob/master/Documentation/backends.md#host-gw
Use host-gw to create IP routes to subnets via remote machine IPs. Requires direct layer2 connectivity between hosts running flannel.
a
also I believe I got it to work, by deploying a pod in a master node, I had to add a toleration for
k3s-controlplane: true
on a test pod, and as soon as a pod was scheduled, cni0 suddenly appreared on the master node
I think this is very unintuitive and also defeats the point of master nodes no? I mean if I don't schedule a pod on them networking breaks?
b
Don't you have pods like coredns or traeffik deployed in master?
a
no, those apparently also do not have tolerations for controlplane
b
If you choose your own taints for the node, I guess you should change the toleration of the pods, right?
I can see that the pods are expecting these taints:
Copy code
- effect: NoSchedule
    key: <http://node-role.kubernetes.io/control-plane|node-role.kubernetes.io/control-plane>
    operator: Exists
  - effect: NoSchedule
    key: <http://node-role.kubernetes.io/master|node-role.kubernetes.io/master>
    operator: Exists
a
ah, interesting, I believe the taint was changed here https://github.com/rancher/docs/issues/2707
rtfm I guess,
--node-taint CriticalAddonsOnly=true:NoExecute
will try this
working! was using wrong taint, thank a lot @creamy-pencil-82913 and @bland-account-99790!
🙌 1