Hey guys, can someone confirm something basic for ...
# rke2
c
Hey guys, can someone confirm something basic for me about the architecture, in particular the rke2-server and/or rke2-agent start-up. I've read this but can't get an accurate grasp. The RKE2 config file (/etc/rancher/rke2/rke2.yaml) which contains the "server" address on TCP 9345, currently, depending on node this is configured to point to different master nodes, mostly to the first one built, but some workers are pointing to a different master. I'm assuming here that once RKE2 is setup and the pod manifests and helm charts are applied, the RKE2 configuration file is no longer referenced on reboot? I ask ask as I don't currently have a load balancer fixed registration address in my lab, and as a test I shutdown a master node that other nodes were pointing to, then did a reboot (leaving the master node they are point to off) and everything came back up no problem. Am I right in thinking that once RKE2 is built it no longer references the "server" part of the configuration file (or even the whole file)?
not quite sure why the nodes come back ok with the server down, however there is now a load balancer as a fixed registration address. rke2-server certainly does not load without the file on reboot. For some reason it appears rke2-agent (worker) will with the "server" entry pointing to something erroneous. This would normally point to the load balancer, so its no longer an issue, just trying to work out why it loads fine without an entry... hmm.
b
rke2.yaml is no config file, but the kubeconfig to access the cluster later. The actual config is config.yaml in the same directory, or some more yamls in the config.d directory. And yes, you need the config later. Always. You should continue to read the docs 😉
c
It's true, I made a mistake there on my file paths, for my rancher on harvester deployed RKE2 the config file is here: /etc/rancher/rke2/config.yaml.d/50-rancher.yaml For my manual RKE2 its here: /etc/rancher/rke2/config.yaml I'll do some more testing, if I remove the "server" line, or set it to something erroneous, RKE2 still works on reboot. Hence the original question, it appears to not be dependant on it.
b
The server line is (at least) needed for a node to join the cluster. I bet, it's still used later for other things, but I don't know details. You should keep it.
c
sure, its "supposed" to point to the fixed registration address, which in my env is a ha proxy vip in my case. I'm breaking it intentionally to work things out more precisely, if you dont know thats np. Will keep digging.
b
It's ok to point it to an external loadbalancer. As long as the loadbalancer can reach any node. Check e.g. Harvester, they're doing it the same way, but the loadbalancer is "just" a kube-vip.
c
yeah, working the harvester one out, done a custom build with calico and BGP with route reflectors to ToR, it's very nice indeed. The control plane VIP from the worker node makes sense from the agent side, I can see the setup in these files. /var/lib/rancher/rke2/agent/etc/rke2-agent-load-balancer.json /var/lib/rancher/rke2/agent/etc/rke2-api-server-agent-load-balancer.json Just trying to piece together its inner workings from the master nodes. Kube VIP is there but not in BGP or ARP mode. It's only there for the harvester cloud provider to provision services on the harvester load balancer. My harvester load balancer has nothing provisioned, and the rke2 conf files all point to one node on 10.254.32.102.
Copy code
NAME                                STATUS   ROLES                       AGE     VERSION          INTERNAL-IP     EXTERNAL-IP   OS-IMAGE               KERNEL-VERSION     CONTAINER-RUNTIME
rke2-test-system-pool-k6zmb-2mfxs   Ready    control-plane,etcd,master   2d15h   v1.32.5+rke2r1   10.254.32.106   <none>        SUSE Linux Micro 6.1   6.4.0-19-default   <containerd://2.0.5-k3s1>
rke2-test-system-pool-k6zmb-xv6ns   Ready    control-plane,etcd,master   3d15h   v1.32.5+rke2r1   10.254.32.102   <none>        SUSE Linux Micro 6.1   6.4.0-19-default   <containerd://2.0.5-k3s1>
rke2-test-system-pool-k6zmb-zglrg   Ready    control-plane,etcd,master   3d      v1.32.5+rke2r1   10.254.32.104   <none>        SUSE Linux Micro 6.1   6.4.0-19-default   <containerd://2.0.5-k3s1>
rke2-test-worker-pool-58fmt-hzt7v   Ready    worker                      2d15h   v1.32.5+rke2r1   10.254.32.105   <none>        SUSE Linux Micro 6.1   6.4.0-19-default   <containerd://2.0.5-k3s1>
rke2-test-worker-pool-58fmt-mv52b   Ready    worker                      3d18h   v1.32.5+rke2r1   10.254.32.100   <none>        SUSE Linux Micro 6.1   6.4.0-19-default   <containerd://2.0.5-k3s1>
b
Was this cluster provisioned by Rancher? I think this is ok as it is. There is always a "first node" which all the other nodes are pointing to. If you configure IPpools, then kube-vip would also configure a VIP in arp mode.
c
yeah it was configured via rancher using the harvester cloud provisioner
Trying to get head around how the master nodes continue to talk when the "server" entry in the RKE2 config file is dead. IPPools are part of Calico, i've done those and are routing the pod cidr, cluster cidr and services of type load balancer. Thats all sorted. It's purely the failover mechanisms of the rke2-server roles and how they function that is not clear yet.
might be best having this chat in #C03JQGGMWA0 to be fair
b
I meant the IPPools of Harvester. And no, there is no role change. Any controlplane node can be the destination of the "server" line. They write things down into their configs, etcd... The procedure of replacing the first master is exactly doing that: Change the config of the new master to point to another master. Period.
c
If a master dies that the other two are pointing to, and then auto updates occur and reboots the remaining two... then its broken as rke2-server service will fail. Seems a strange way to build them, if this was manually built RKE2 they would all point to a VIP for TCP 9345.
b
Your monitoring should have told you before this occured 🙂
c
thats no use on weekends or holidays, dont think i like how its built, id rather have it point to a vip that automates it
discussing this in #C03JQGGMWA0 as well, others are experiencing issues