Hello all, I am running a HA k3s cluster with 3 co...
# k3s
c
Hello all, I am running a HA k3s cluster with 3 control-plane nodes, I am setting up a rolling upgrade script to rollout one control-plane node at a time. Here what I do for each of the 3 nodes:
• create a new node then drain the original node
• delete the node
• terminate the ec2 instances
Most of the time, it is working, but sometimes the new node does not registrer properly and I need to do the following to remove the old node
Copy code
+----------------------------+--------+--------------+---------------------------+
|          ENDPOINT          | HEALTH |     TOOK     |           ERROR           |
+----------------------------+--------+--------------+---------------------------+
| <https://10.46.232.150:2379> |   true |  10.703449ms |                           |
|  <https://10.46.233.60:2379> |   true |  13.320709ms |                           |
|  <https://10.46.232.22:2379> |   true |  30.889352ms |                           |
|  <https://10.46.233.42:2379> |  false | 5.001702693s | context deadline exceeded |
+----------------------------+--------+--------------+---------------------------+
Copy code
+------------------+---------+------------------------------------------------------+----------------------------+----------------------------+------------+
|        ID        | STATUS  |                         NAME                         |         PEER ADDRS         |        CLIENT ADDRS        | IS LEARNER |
+------------------+---------+------------------------------------------------------+----------------------------+----------------------------+------------+
|   e85825167aeb74 | started |  ip-10-46-232-22.eu-west-1.compute.internal-850ae29a |  <https://10.46.232.22:2380> |  <https://10.46.232.22:2379> |      false |
|  b08c569b39de238 | started |  ip-10-46-233-60.eu-west-1.compute.internal-6e3a5329 |  <https://10.46.233.60:2380> |  <https://10.46.233.60:2379> |      false |
| cd8c7e2146c24d33 | started |  ip-10-46-233-42.eu-west-1.compute.internal-368aaa88 |  <https://10.46.233.42:2380> |  <https://10.46.233.42:2379> |      false |
| e0f0b05b7170b845 | started | ip-10-46-232-150.eu-west-1.compute.internal-0ec17b5a | <https://10.46.232.150:2380> | <https://10.46.232.150:2379> |      false |
+------------------+---------+------------------------------------------------------+----------------------------+----------------------------+------------+
=> remove the failing node:
etcdctl member remove cd8c7e2146c24d33
I am using v1.32.1+k3s1, is it expected I would like now to have to clean this manually 😞
Is it because I terminate the old node too quickly after deleting the node from the cluster ?
As soon as I delete the failing etcd node, the new node is appearing
c
It sounds like you’re deleting the old node before the new node has finished joining. I would probably suggest waiting for the new node to show as
Ready
in
kubectl get node
before deleting the old one. Can’t say for sure without logs though.
c
I am creating a new node, waiting for it to be ready, then draining the old node, then deleting the old node then terminate the old node. Sometimes the old node is kept in etcd. Then new "control-plane" node cannot be added until i delete the entry manually
c
When you delete the old node, there will be logs of k3s trying to remove it from etcd. Find out why that is failing.
Check logs on all nodes, including the one you are deleting
c
ok the etcd leader node ?
ok i'll check that, thanks
c
no. k3s doesn’t care which node is the etcd leader. etcd takes care of voting on that by itself.
c
ok, can i restrict the log analysis on my 3/4 control-plane nodes ?
or even agent nodes ?
c
Just the etcd nodes
and like I said, including the one you deleted
1
c
yes, thanks
c
Catch it before you terminate the instance and see what it says after it’s deleted. You should see one of the other etcd nodes trying to remove it from the cluster after the Kubernetes node resource is deleted.
c
ok
I'll do that tomorrow it is late here. Thanks a lot, I wanted to be sure that I was doing what is expected.
@creamy-pencil-82913 I did managed to reproduced my issue only once, but in all log i don't really know what string to look at. Would you know which string I can look at to see when the node is removed from etcd ?
I found that:
Copy code
logrus.Infof("Starting managed etcd member removal controller")
    nodes.OnChange(ctx, "managed-etcd-controller", e.sync)
    nodes.OnRemove(ctx, "managed-etcd-controller", e.onRemove)
}

var (
    removalAnnotation         = "etcd." + version.Program + ".<http://cattle.io/remove|cattle.io/remove>"
    removedNodeNameAnnotation = "etcd." + version.Program + ".<http://cattle.io/removed-node-name|cattle.io/removed-node-name>"
)
c
Thanks a lot. It will be much easier
p
Since i setup "graceful shutdown feature" in kubelet, it seems that I cannot reproduce my problem.