This message was deleted Rancher Users #harvester

Join Slack

This message was deleted.

# harvester

adamant-kite-43734

07/20/2024, 4:48 PM

This message was deleted.

great-bear-19718

07/21/2024, 12:53 AM

a support bundle would be a nice place to get started

worried-state-78253

07/21/2024, 6:53 AM

I put the node into maintenance mode and restarted it and it came up looking more healthy - however it is now cordoned, restarting vms (only 2 on there) migrated them to other hosts and the stack is running with this one node cordoned. We were experimenting with using 1 replica on cluster workers, since the workers are throw-away the thought was that if something was going wrong in a worker the pods should migrate to a healthy worker and a new one could be spun up - at the same time the I/O benefit of not having a replica sounded good on paper - shoot me down if this is not the done thing and could have been a contributing factor.

worried-state-78253

07/21/2024, 6:55 AM

Also - we run minio in k8s, and it’s storage we also reduced replicas to 1 since minio handles replication itself the assumption was that minio will handle this and it wasn’t working well, with dreadful I/O speeds - these containers were in part hosted on these workers. All of this seems happy now our new node is out of the equation.

worried-state-78253

07/21/2024, 6:59 AM

However I don’t seem to be able to uncordon the node. It will come up then quickly becomes cordoned.

worried-state-78253

07/21/2024, 7:00 AM

Will generate a package now. This node has a different Motherboard but same CPU, different Nic names, and more storage but same types.

worried-state-78253

07/21/2024, 7:13 AM

attached

supportbundle_9af7d11f-d06e-44e2-b611-2acd3a809037_2024-07-21T07-02-50Z.zip

worried-state-78253

07/22/2024, 11:12 AM

Where would be the right place to look for logs in this situation of not being able to un-cordon the host? Are there pods in the harvester cluster I can look at to see whats happening? Can't see anything in the UI to help explain whats happening.

worried-state-78253

07/22/2024, 4:37 PM

Might have worked out the problem...might... when looking at logs of the pod kube-proxy-n5 I could see it wanted to be IP .200, however this was different to that which was reported in the UI, so I updated my network so its was assigned .200 (rather than .61) looks like its sorting itself out now... its picked up the new IP and it's gone into maintenance on startup, stats just started appearing so hoping it will wake up soon...

worried-state-78253

07/22/2024, 4:42 PM

yep - thats fixed it, i manually disabled maintenance mode, assuming harvester enabled that as a precaution with things changing! Host looks good now so this is resolved. For anyone else that gets this - what helped me was looking at the harvester cluster in lens at the logs of the related kube-proxy and the reason was more obvious!

5 Views

Open in Slack

Previous Next