Anyone had an issue in 1.4.0 with removing a faile...
# harvester
w
Anyone had an issue in 1.4.0 with removing a failed node. We've had a bunch of power cuts and the boot drive has failed on 1 of 5 nodes, have managed to clear all storage related to it - VM's have migrated - looks like monitoring was crashed with a failed volume on this machine - disabled monitoring, the failed pods in the embedded rancher UI which were stuck "terminating" have been removed - the machine itself is actually physically removed as were replacing the faulty boot drive. Yet the node still shows in the dashboard, despite being cordoned, requested to be deleted and put into maintenance... steps taken many hours apart while we waited for volume rebuilds to finish. Just running installer from USB to get the new boot drive installed. Just a bit concerned that this needs cleaning up still...
From lonhorn - attempting to remove the failed node
"failed to delete node: could not delete node n5 with node ready condition is False, reason is KubernetesNodeNotReady, node schedulable false, and 0 replica, 0 engine running on it"
Looking at https://docs.harvesterhci.io/v1.4/host/ running
Copy code
kubectl drain n5 --force --ignore-daemonsets --delete-local-data --pod-selector='app!=csi-attacher,app!=csi-provisioner'
but looks like this is getting not throttled far... tail of current output as follows -
Copy code
I0311 15:52:39.693517   71719 request.go:697] Waited for 4.373975041s due to client-side throttling, not priority and fairness, request: GET:<https://rancher.web-engineer/k8s/clusters/c-m-j5d4hx4s/api/v1/namespaces/harvester-system/pods/virt-controller-86b84c8f8f-svhhz>
I0311 15:52:49.892506   71719 request.go:697] Waited for 4.572894375s due to client-side throttling, not priority and fairness, request: GET:<https://rancher.web-engineer/k8s/clusters/c-m-j5d4hx4s/api/v1/namespaces/harvester-system/pods/virt-controller-86b84c8f8f-svhhz>
I0311 15:52:59.893157   71719 request.go:697] Waited for 4.583385375s due to client-side throttling, not priority and fairness, request: GET:<https://rancher.web-engineer/k8s/clusters/c-m-j5d4hx4s/api/v1/namespaces/harvester-system/pods/harvester-network-controller-manager-7456fb5c47-tk66w>
I0311 15:53:09.896043   71719 request.go:697] Waited for 4.7700325s due to client-side throttling, not priority and fairness, request: GET:<https://rancher.web-engineer/k8s/clusters/c-m-j5d4hx4s/api/v1/namespaces/cattle-system/pods/rancher-7497cc497c-2glff>
I0311 15:53:20.093985   71719 request.go:697] Waited for 4.968612416s due to client-side throttling, not priority and fairness, request: GET:<https://rancher.web-engineer/k8s/clusters/c-m-j5d4hx4s/api/v1/namespaces/longhorn-system/pods/csi-resizer-b4569789c-9rtz7>
I0311 15:53:30.292884   71719 request.go:697] Waited for 4.173486125s due to client-side throttling, not priority and fairness, request: GET:<https://rancher.web-engineer/k8s/clusters/c-m-j5d4hx4s/api/v1/namespaces/harvester-system/pods/harvester-node-manager-webhook-b4fcdcbc4-lgmsg>
I0311 15:53:40.293912   71719 request.go:697] Waited for 4.367710875s due to client-side throttling, not priority and fairness, request: GET:<https://rancher.web-engineer/k8s/clusters/c-m-j5d4hx4s/api/v1/namespaces/longhorn-system/pods/longhorn-ui-d5c959b97-f6v5v>
I0311 15:53:50.494342   71719 request.go:697] Waited for 4.580874s due to client-side throttling, not priority and fairness, request: GET:<https://rancher.web-engineer/k8s/clusters/c-m-j5d4hx4s/api/v1/namespaces/cattle-fleet-system/pods/gitjob-8bbb88865-9x7bj>
I0311 15:54:00.495598   71719 request.go:697] Waited for 4.770384s due to client-side throttling, not priority and fairness, request: GET:<https://rancher.web-engineer/k8s/clusters/c-m-j5d4hx4s/api/v1/namespaces/cattle-system/pods/cattle-cluster-agent-69669c55b5-fljjm>
I0311 15:54:10.693941   71719 request.go:697] Waited for 4.961615417s due to client-side throttling, not priority and fairness, request: GET:<https://rancher.web-engineer/k8s/clusters/c-m-j5d4hx4s/api/v1/namespaces/cattle-system/pods/cattle-cluster-agent-69669c55b5-fljjm>
I0311 15:54:20.694135   71719 request.go:697] Waited for 4.171670791s due to client-side throttling, not priority and fairness, request: GET:<https://rancher.web-engineer/k8s/clusters/c-m-j5d4hx4s/api/v1/namespaces/cattle-monitoring-system/pods/prometheus-rancher-monitoring-prometheus-0>
I0311 15:54:30.894116   71719 request.go:697] Waited for 4.370443291s due to client-side throttling, not priority and fairness, request: GET:<https://rancher.web-engineer/k8s/clusters/c-m-j5d4hx4s/api/v1/namespaces/cattle-monitoring-system/pods/prometheus-rancher-monitoring-prometheus-0>
I0311 15:54:41.094281   71719 request.go:697] Waited for 4.579774041s due to client-side throttling, not priority and fairness, request: GET:<https://rancher.web-engineer/k8s/clusters/c-m-j5d4hx4s/api/v1/namespaces/cattle-system/pods/harvester-cluster-repo-7c85db6555-s7898>
I0311 15:54:51.294343   71719 request.go:697] Waited for 4.94382775s due to client-side throttling, not priority and fairness, request: GET:<https://rancher.web-engineer/k8s/clusters/c-m-j5d4hx4s/api/v1/namespaces/cattle-monitoring-system/pods/rancher-monitoring-prometheus-adapter-55dc9ccd5d-l87lm>
I0311 15:55:01.494167   71719 request.go:697] Waited for 5.160122333s due to client-side throttling, not priority and fairness, request: GET:<https://rancher.web-engineer/k8s/clusters/c-m-j5d4hx4s/api/v1/namespaces/longhorn-system/pods/backing-image-manager-5ae6-e1b0>
I wonder if this is going to complete or not - suspect something is preventing these pods from draining - the node I've managed to boot up again and its available via shell - but its not connecting to Harvester - stuck on "Setting up harvester" presently. I've got a new drive to replace the boot drive now, the current drive in there is temperamental and keeps refusing to be recognised at boot so aiming to get this node removed and swap out the drive and re-introduce it to the cluster. My thought is having it available might help removing it from the cluster somehow.
ok - manually deleted the node using kubectl, a remaining node was promoted... seeing if I can get it to re-join now!
Copy code
n1:~ # kubectl get nodes
NAME   STATUS                        ROLES                       AGE    VERSION
n1     Ready                         control-plane,etcd,master   201d   v1.29.9+rke2r1
n2     Ready                         <none>                      201d   v1.29.9+rke2r1
n3     Ready                         <none>                      201d   v1.29.9+rke2r1
n4     Ready                         control-plane,etcd,master   211d   v1.29.9+rke2r1
n5     NotReady,SchedulingDisabled   control-plane,etcd,master   211d   v1.29.9+rke2r1
n1:~ # kubectl delete node n5
node "n5" deleted
n1:~ # kubectl get nodes
NAME   STATUS   ROLES                       AGE    VERSION
n1     Ready    control-plane,etcd,master   201d   v1.29.9+rke2r1
n2     Ready    <none>                      201d   v1.29.9+rke2r1
n3     Ready    <none>                      201d   v1.29.9+rke2r1
n4     Ready    control-plane,etcd,master   211d   v1.29.9+rke2r1
n1:~ # kubectl get nodes
NAME   STATUS   ROLES                       AGE    VERSION
n1     Ready    control-plane,etcd,master   201d   v1.29.9+rke2r1
n2     Ready    control-plane,etcd,master   201d   v1.29.9+rke2r1
n3     Ready    <none>                      201d   v1.29.9+rke2r1
n4     Ready    control-plane,etcd,master   211d   v1.29.9+rke2r1
Thats worked!