This message was deleted Rancher Users #rke2

Join Slack

This message was deleted.

# rke2

adamant-kite-43734

03/28/2024, 8:04 AM

This message was deleted.

adventurous-magazine-13224

03/28/2024, 8:08 AM

I've done this a few ways in the past, but usually I'll: 1. Add the new controlplane, so it's a 4 control plane cluster 2.

kubectl drain --ignore-daemonsets nodeIDHere

3. See if things all look okay, then shutdown the drained node

creamy-pencil-82913

03/28/2024, 8:24 AM

What exactly became unresponsive? As long as you have quorum on the etcd cluster should continue running, and the other control-plane nodes should function. Did you run the killall script to ensure that everything was completely shut down on the node you stopped?

crooked-cat-21365

03/28/2024, 8:31 AM

I couldn't reach the cluster using kubectl anymore. Since my kube config file points to the rancher installation I would guess rancher is to blame here. I ran "systemctl shutdown rke2-server" on the control plane node before shutdown.

crooked-cat-21365

03/28/2024, 8:34 AM

I thought about adding a fourth control plane node, but I found the known issues about v2.8.2 in this context quite alarming. Esp permanently removing the very first control plane/etcd node makes me pretty nervous.

creamy-pencil-82913

03/28/2024, 10:20 AM

Did your kubectl point at the node you shut down? Do you have an external LB in front of the apiserver, or are you just pointing at a specific server?

crooked-cat-21365

03/28/2024, 10:39 AM

No, the kubeconfig file points to the Rancher cluster "rancher02.example.com". There is no LB or round-robin DNS or similar for the IP addresses of the control plane nodes of the managed cluster. There is no proxy, either. The Authorized Endpoint for the managed cluster is set to "Disabled" in Rancher. There are no TLS Alternate Names defined.

bulky-sunset-52084

03/28/2024, 2:11 PM

Rancher talks to the downstream cluster via a deployment called the cattle-cluster-agent. I bet you took down the node it's running on and it's not starting on a new node again. Can you check the status of that pod?

crooked-cat-21365

03/28/2024, 3:40 PM

There are 2 cluster agents supposed to run, AFAICS. Shouldn't Rancher connect to the other one?

creamy-pencil-82913

03/29/2024, 2:40 AM

Thats not how it works. They are active/standby, and maintain an outbound connection to the Rancher server. The backup will take over when the active one fails to renew its lease.

creamy-pencil-82913

03/29/2024, 2:41 AM

That is standard behavior for lease locked HA controllers.

crooked-cat-21365

04/01/2024, 8:39 AM

@creamy-pencil-82913 I think there is a misunderstanding here, which is surely my fault. I should have described the whole setup right from the start. I am running Rancher 2.8.2 in a 3 node setup. The common host name mapped to the IP addresses of these 3 nodes is "rancher02". All my kube config files use this host name on the cluster server line, esp. the kube config files for the managed clusters. Using this rancher installation I had setup a managed cluster kube005, starting with 1 control plane / etcd node ("kube005c00") and 1 worker node ("kube005w00"). Over time it was extended to 3 cp/etcd nodes and 7 worker nodes. The managed cluster does not provide an authorized endpoint. Problem is, if I shut down kube005c00, then I cannot reach the cluster via kubectl anymore. I had expected some kind of fail-over (within 30 seconds) to make use of the second or third cp/etcd node somehow. Since my kubeconfig file only points to rancher02, I think Rancher is to blame here. Is it?

4 Views

Open in Slack

Previous Next