This message was deleted.
# rancher-setup
a
This message was deleted.
s
We were also troubleshooting the following files: •
/var/lib/rancher/rke2/agent/etc/rke2-agent-load-balancer.json
/var/lib/rancher/rke2/agent/etc/rke2-api-server-agent-load-balancer.json
What we found was that when a master node goes down, its IP is removed from the
ServerAddresses
list. However, for some worker nodes, the unhealthy master IP is not being removed. Additionally, we observed that after a couple of minutes, the unhealthy master IP is being re-added to the list, which is causing the cluster to go down.
Seems like it was a bug on 1.28.0 rke2 https://github.com/rancher/rke2/issues/5949 after upgrade it is working as expected.
c
hey, I am also in a similar boat. My RKE2 cluster was deployed to Harvester via Rancher. The mentioned json files on the workers appears good, i.e. it lists all three rke2 server nodes. I'm trying to get my head around why the second and third rke2-server nodes (master, cp, etcd) both point directly to the first node. There is no vip for port 9345 I can see. There is kube-vip in ARP mode for 6443. /etc/rancher/rke2/config.yaml.d/50-rancher.yaml If I was to build RKE2 myself I would craft the config files to all point at the centralised, load balanced VIP for 6443 and 9345. If I shut a master node I get similar results to you @stocky-smartphone-18501
FYI im running v1.32.5+rke2r1, built by latest rancher, hosted on latest harvester.
d
We are experiencing a simmilair issue in a harvester setup. Don’t kknow the rke2 version but we just upgraded harvester to 1.5.0. We tracked down that it’s the first node we Installed and the others joined the cluster to. When the first node goes down, the vip is also down. Even though we can see that Kube-vip pod did get the of another node, the vip stays unavailable.
c
is your kube-vip just for kubeapi on 6443, what about rke2-server on 9345?
i only have a vip for kubeapi
this is all i have with kube-vip - name: cp_enable value: "false" - name: enable_service_security value: "true" - name: lb_enable value: "true" - name: lb_port value: "6443" - name: svc_election value: "false" - name: svc_enable value: "true" - name: vip_arp value: "true" - name: vip_cidr value: "32" - name: vip_interface - name: vip_leaderelection value: "false"
for rke2-server it points directly to nodes... which cant be right
If a master dies that the other two are pointing to, and then auto updates occur and reboots the remaining two... then its broken as rke2-server service will fail. Seems a strange way to build them. Am i going mad or is that just going to fail easily if you dont sit on it and respond to failed nodes within 24hrs.
default auto update is daily on suse sle-micro I think
For the kubeapi, I advertise the service to the underlay using calico, which works around api access issues. I’m more concerned with how the rke2 nodes register on tcp 9345, those are not on a vip on mine, when built with rancher using harvester provider. As described above it’s not great for resiliency when things reboot and you have nodes pointing to nodes - not via a LB. Dunno if it’s always like that or if it’s just my setup? Be good to hear others experience.