Ok - more fun with replacing a node... This node ...
# harvester
w
Ok - more fun with replacing a node... This node had a failed SSD boot drive, its been re-installed - the same machine is still listed as a node (cordoned). The API shows this on the node -
Copy code
"conditions": [ 6 items
{
"error": false,
"lastHeartbeatTime": "2025-07-08T09:18:06Z",
"lastTransitionTime": "2025-07-07T17:33:06Z",
"lastUpdateTime": "2025-07-07T17:33:06Z",
"message": "Node is not a member of the etcd cluster",
"reason": "NotAMember",
"status": "False",
"transitioning": false,
"type": "EtcdIsVoter"
},
I did delete 2 pending volumes from longhorn trying to attach (deamonsets I'm guessing for monitoring) as thought that may have prevented it from being removed ( I did earlier try to delete but still the node stays ). The reinstalled node has the same MAC/IP/Name - but has been reinstalled. Any ideas how to get this node back into the party? I can change its assigned IP and attempt to add it as a new node, question is how to remove the old node... doesn't seem to want to clear at present. Based on the node API output it looks like the node is trying to talk but is being denied because etcd says no - guessing harvester isn't going to re-init an existing node? Harvester 1.4.1 - would raise as a bug but I'm not sure if I've missed something obvious.
Trying to drain this now to see if that'll sort things so it can be deleted -
Copy code
kubectl drain n1 --delete-emptydir-data --ignore-daemonsets --force
Ultimately no workloads are on there - everything was migrated when the machine failed - and longhorn is looking happy (apart from 1 node down, all volumes look ok)
looks like this is cleaning things up... I guess the issue is k8s under the hood still thinks that node may come back... it wont as it was reinstalled so has to go first before it can rejoin a fresh...
This is taking some time and I'm not sure if its just stuck in a loop... Also spotted that n1 was a control-plane node, we have 2 left - but I'd expect one of the other workers to get promoted in this situation?
shutting down the repaired node now - as I think it might be confusing things if its trying to join while were trying to remove it.
Disabling monitoring for now too, looks like the node drain isn't really working as all the pods are stuck terminating.
Guessing this is down to these -
Copy code
"finalizers": [ 5 items
"<http://wrangler.cattle.io/node-remove-controller|wrangler.cattle.io/node-remove-controller>",
"<http://wrangler.cattle.io/harvester-network-manager-node-controller|wrangler.cattle.io/harvester-network-manager-node-controller>",
"<http://wrangler.cattle.io/managed-etcd-controller|wrangler.cattle.io/managed-etcd-controller>",
"<http://wrangler.cattle.io/node|wrangler.cattle.io/node>",
"<http://wrangler.cattle.io/maintain-node-controller|wrangler.cattle.io/maintain-node-controller>"
]
Since the node is stuck Im going to clear these with
Copy code
kubectl patch node n1 -p '{"metadata":{"finalizers":[]}}' --type=merge
checking back in the API it now has 3 finalizers - but node hasn't deleted and still wont delete. I'll get this on an issue this avo as proving a bit of an issue. Ultimately one would expect that if hardware fails and is repaired it should be possible to handle this., we cant pre-empt when hardware will fail.
OK - just checking Github a bit more, currently I have a node not joining and one stuck deleting. The one not joining is powered off now, to simplfy things. Reading this we can see the following state information-
Copy code
craig@Craigs-Mac-Studio ~ % kubectl get nodes -A
NAME   STATUS                        ROLES                       AGE    VERSION
n1     NotReady,SchedulingDisabled   control-plane,etcd,master   320d   v1.30.7+rke2r1
n2     Ready                         control-plane,etcd,master   320d   v1.30.7+rke2r1
n3     Ready                         <none>                      320d   v1.30.7+rke2r1
n4     Ready                         control-plane,etcd,master   330d   v1.30.7+rke2r1
n5     Ready                         <none>                      119d   v1.30.7+rke2r1
craig@Craigs-Mac-Studio ~ % kubectl get machines -A
NAMESPACE     NAME                  CLUSTER   NODENAME   PROVIDERID   PHASE          AGE    VERSION
fleet-local   custom-0a80bc3bd515   local                             Provisioning   39h    
fleet-local   custom-26be63540ad5   local     n1         <rke2://n1>    Deleting       320d   
fleet-local   custom-74cf64960fd8   local     n5         <rke2://n5>    Running        119d   
fleet-local   custom-a7e894cdd1b5   local     n2         <rke2://n2>    Running        320d   
fleet-local   custom-c734823aefbd   local     n4         <rke2://n4>    Running        330d   
fleet-local   custom-c8a6bbbeebc2   local     n3         <rke2://n3>    Running        320d
So it looks like n1 is stuck deleting currently, going to leave this until after lunch just incase its because when I first came in I tried to delete again, but nothing seemed to happen... Since the UI just says cordoned its not clear without running the above that it is trying to delete. Assuming something isn't being satisfied somewhere.
m
I ran into this issue when I recently built a cluster and noticed I misconfigured some storage (which resulted in the wrong storage amount being displayed after I reconfigured it under the config option despite it having the same amount of storage as all my other nodes.) I tried removing the finalizer as well, but it still stuck on deleting. I ultimately just rebuilt my cluster since I hadn't deployed any VMs yet but yes - we should be able to recover a node that was replaced for whatever reason in regards to it no longer being available. Hopefully you can get some answers in here.
๐Ÿ™ 1
b
Might not help with the broken node but can you rejoin the node with a new name to get it back?
w
i would need to reinstall it and assign it a new IP I think - its not critical since we've got ample capacity - the critical thing is that the control plane only has 2 etcd nodes presently as it hasn't promoted one of the other nodes I believe.
Just raising this on the github now as need to sort it - dont fancy building out another cluster and restoring it if I can help it!
p
w
as tempting as that sounds I think I want to find out why harvester hasn't triggered the onboard, hacking it may not be ideal - would rather spot what prevents a worker being promoted - not precious about this node returning - its just a work horse end of the day
Ive fixed it - the notes are on my github issue - I think this is still a bug however. The trick was to force delete the nodes that the drain was trying to clear - since these simply don't exist anymore... this triggered things to move forward.
๐ŸŽ‰ 1
m
Bug, or an inherited trait from kubernetes? ๐Ÿ˜„
w
Well - its managed - and harvesters role is to handle this - there may be k8s behind it and yes you can tweak things there but ideally this should be the last resort surely...
m
oh, I 100% agree with you there
w
hence why i kinda went deep and treated this with a light touch, noting steps on the issue, this is a working cluster so needed to make sure it transitioned smoothly - and its all back up and running now ๐Ÿ™‚