This message was deleted.
# rke2
a
This message was deleted.
a
I've done this a few ways in the past, but usually I'll: 1. Add the new controlplane, so it's a 4 control plane cluster 2.
kubectl drain --ignore-daemonsets nodeIDHere
3. See if things all look okay, then shutdown the drained node
c
What exactly became unresponsive? As long as you have quorum on the etcd cluster should continue running, and the other control-plane nodes should function. Did you run the killall script to ensure that everything was completely shut down on the node you stopped?
c
I couldn't reach the cluster using kubectl anymore. Since my kube config file points to the rancher installation I would guess rancher is to blame here. I ran "systemctl shutdown rke2-server" on the control plane node before shutdown.
I thought about adding a fourth control plane node, but I found the known issues about v2.8.2 in this context quite alarming. Esp permanently removing the very first control plane/etcd node makes me pretty nervous.
c
Did your kubectl point at the node you shut down? Do you have an external LB in front of the apiserver, or are you just pointing at a specific server?
c
No, the kubeconfig file points to the Rancher cluster "rancher02.example.com". There is no LB or round-robin DNS or similar for the IP addresses of the control plane nodes of the managed cluster. There is no proxy, either. The Authorized Endpoint for the managed cluster is set to "Disabled" in Rancher. There are no TLS Alternate Names defined.
b
Rancher talks to the downstream cluster via a deployment called the cattle-cluster-agent. I bet you took down the node it's running on and it's not starting on a new node again. Can you check the status of that pod?
c
There are 2 cluster agents supposed to run, AFAICS. Shouldn't Rancher connect to the other one?
c
Thats not how it works. They are active/standby, and maintain an outbound connection to the Rancher server. The backup will take over when the active one fails to renew its lease.
That is standard behavior for lease locked HA controllers.
c
@creamy-pencil-82913 I think there is a misunderstanding here, which is surely my fault. I should have described the whole setup right from the start. I am running Rancher 2.8.2 in a 3 node setup. The common host name mapped to the IP addresses of these 3 nodes is "rancher02". All my kube config files use this host name on the cluster server line, esp. the kube config files for the managed clusters. Using this rancher installation I had setup a managed cluster kube005, starting with 1 control plane / etcd node ("kube005c00") and 1 worker node ("kube005w00"). Over time it was extended to 3 cp/etcd nodes and 7 worker nodes. The managed cluster does not provide an authorized endpoint. Problem is, if I shut down kube005c00, then I cannot reach the cluster via kubectl anymore. I had expected some kind of fail-over (within 30 seconds) to make use of the second or third cp/etcd node somehow. Since my kubeconfig file only points to rancher02, I think Rancher is to blame here. Is it?