Ok more fun with replacing a node This node had a failed SSD Rancher Users #harvester

Ok - more fun with replacing a node... This node ...

worried-state-78253

07/08/2025, 9:42 AM

Ok - more fun with replacing a node... This node had a failed SSD boot drive, its been re-installed - the same machine is still listed as a node (cordoned). The API shows this on the node -

Copy code

"conditions": [ 6 items
{
"error": false,
"lastHeartbeatTime": "2025-07-08T09:18:06Z",
"lastTransitionTime": "2025-07-07T17:33:06Z",
"lastUpdateTime": "2025-07-07T17:33:06Z",
"message": "Node is not a member of the etcd cluster",
"reason": "NotAMember",
"status": "False",
"transitioning": false,
"type": "EtcdIsVoter"
},

I did delete 2 pending volumes from longhorn trying to attach (deamonsets I'm guessing for monitoring) as thought that may have prevented it from being removed ( I did earlier try to delete but still the node stays ). The reinstalled node has the same MAC/IP/Name - but has been reinstalled. Any ideas how to get this node back into the party? I can change its assigned IP and attempt to add it as a new node, question is how to remove the old node... doesn't seem to want to clear at present. Based on the node API output it looks like the node is trying to talk but is being denied because etcd says no - guessing harvester isn't going to re-init an existing node? Harvester 1.4.1 - would raise as a bug but I'm not sure if I've missed something obvious.

worried-state-78253

07/08/2025, 9:46 AM

Trying to drain this now to see if that'll sort things so it can be deleted -

Copy code

kubectl drain n1 --delete-emptydir-data --ignore-daemonsets --force

Ultimately no workloads are on there - everything was migrated when the machine failed - and longhorn is looking happy (apart from 1 node down, all volumes look ok)

worried-state-78253

07/08/2025, 9:50 AM

looks like this is cleaning things up... I guess the issue is k8s under the hood still thinks that node may come back... it wont as it was reinstalled so has to go first before it can rejoin a fresh...

worried-state-78253

07/08/2025, 10:01 AM

This is taking some time and I'm not sure if its just stuck in a loop... Also spotted that n1 was a control-plane node, we have 2 left - but I'd expect one of the other workers to get promoted in this situation?

worried-state-78253

07/08/2025, 10:04 AM

shutting down the repaired node now - as I think it might be confusing things if its trying to join while were trying to remove it.

worried-state-78253

07/08/2025, 10:13 AM

Disabling monitoring for now too, looks like the node drain isn't really working as all the pods are stuck terminating.

worried-state-78253

07/08/2025, 11:06 AM

Guessing this is down to these -

Copy code

"finalizers": [ 5 items
"<http://wrangler.cattle.io/node-remove-controller|wrangler.cattle.io/node-remove-controller>",
"<http://wrangler.cattle.io/harvester-network-manager-node-controller|wrangler.cattle.io/harvester-network-manager-node-controller>",
"<http://wrangler.cattle.io/managed-etcd-controller|wrangler.cattle.io/managed-etcd-controller>",
"<http://wrangler.cattle.io/node|wrangler.cattle.io/node>",
"<http://wrangler.cattle.io/maintain-node-controller|wrangler.cattle.io/maintain-node-controller>"
]

Since the node is stuck Im going to clear these with

Copy code

kubectl patch node n1 -p '{"metadata":{"finalizers":[]}}' --type=merge

checking back in the API it now has 3 finalizers - but node hasn't deleted and still wont delete. I'll get this on an issue this avo as proving a bit of an issue. Ultimately one would expect that if hardware fails and is repaired it should be possible to handle this., we cant pre-empt when hardware will fail.

worried-state-78253

07/09/2025, 8:51 AM

OK - just checking Github a bit more, currently I have a node not joining and one stuck deleting. The one not joining is powered off now, to simplfy things. Reading this we can see the following state information-

Copy code

craig@Craigs-Mac-Studio ~ % kubectl get nodes -A
NAME   STATUS                        ROLES                       AGE    VERSION
n1     NotReady,SchedulingDisabled   control-plane,etcd,master   320d   v1.30.7+rke2r1
n2     Ready                         control-plane,etcd,master   320d   v1.30.7+rke2r1
n3     Ready                         <none>                      320d   v1.30.7+rke2r1
n4     Ready                         control-plane,etcd,master   330d   v1.30.7+rke2r1
n5     Ready                         <none>                      119d   v1.30.7+rke2r1
craig@Craigs-Mac-Studio ~ % kubectl get machines -A
NAMESPACE     NAME                  CLUSTER   NODENAME   PROVIDERID   PHASE          AGE    VERSION
fleet-local   custom-0a80bc3bd515   local                             Provisioning   39h    
fleet-local   custom-26be63540ad5   local     n1         <rke2://n1>    Deleting       320d   
fleet-local   custom-74cf64960fd8   local     n5         <rke2://n5>    Running        119d   
fleet-local   custom-a7e894cdd1b5   local     n2         <rke2://n2>    Running        320d   
fleet-local   custom-c734823aefbd   local     n4         <rke2://n4>    Running        330d   
fleet-local   custom-c8a6bbbeebc2   local     n3         <rke2://n3>    Running        320d

So it looks like n1 is stuck deleting currently, going to leave this until after lunch just incase its because when I first came in I tried to delete again, but nothing seemed to happen... Since the UI just says cordoned its not clear without running the above that it is trying to delete. Assuming something isn't being satisfied somewhere.

microscopic-accountant-76829

07/09/2025, 2:14 PM

I ran into this issue when I recently built a cluster and noticed I misconfigured some storage (which resulted in the wrong storage amount being displayed after I reconfigured it under the config option despite it having the same amount of storage as all my other nodes.) I tried removing the finalizer as well, but it still stuck on deleting. I ultimately just rebuilt my cluster since I hadn't deployed any VMs yet but yes - we should be able to recover a node that was replaced for whatever reason in regards to it no longer being available. Hopefully you can get some answers in here.

🙏 1

brainy-kilobyte-33711

07/09/2025, 4:01 PM

Might not help with the broken node but can you rejoin the node with a new name to get it back?

worried-state-78253

07/09/2025, 4:02 PM

i would need to reinstall it and assign it a new IP I think - its not critical since we've got ample capacity - the critical thing is that the control plane only has 2 etcd nodes presently as it hasn't promoted one of the other nodes I believe.

worried-state-78253

07/09/2025, 4:03 PM

Just raising this on the github now as need to sort it - dont fancy building out another cluster and restoring it if I can help it!

worried-state-78253

07/09/2025, 4:16 PM

Opened here - https://github.com/harvester/harvester/issues/8627

prehistoric-morning-49258

07/09/2025, 5:49 PM

did you try etcd hacking? https://github.com/k3s-io/k3s/issues/10408#issuecomment-2831626871

worried-state-78253

07/09/2025, 10:01 PM

as tempting as that sounds I think I want to find out why harvester hasn't triggered the onboard, hacking it may not be ideal - would rather spot what prevents a worker being promoted - not precious about this node returning - its just a work horse end of the day

worried-state-78253

07/10/2025, 10:46 AM

Ive fixed it - the notes are on my github issue - I think this is still a bug however. The trick was to force delete the nodes that the drain was trying to clear - since these simply don't exist anymore... this triggered things to move forward.

🎉 1

microscopic-accountant-76829

07/10/2025, 3:36 PM

Bug, or an inherited trait from kubernetes? 😄

worried-state-78253

07/10/2025, 3:40 PM

Well - its managed - and harvesters role is to handle this - there may be k8s behind it and yes you can tweak things there but ideally this should be the last resort surely...

microscopic-accountant-76829

07/10/2025, 3:43 PM

oh, I 100% agree with you there

worried-state-78253

07/10/2025, 3:45 PM

hence why i kinda went deep and treated this with a light touch, noting steps on the issue, this is a working cluster so needed to make sure it transitioned smoothly - and its all back up and running now 🙂

Open in Slack

Previous Next