If I reboot a worker node in my cluster, I get ton...
# longhorn-storage
c
If I reboot a worker node in my cluster, I get tons of problems about longhorn block devices and file systems becoming unresponsive. It takes at least 10 minutes til some watchdog pulls the plug and the host finally reboots. How comes? Is networking shut down before longhorn?
b
Are you cordoning and evacuating the node prior to reboot?
c
I try to remember this step ...
b
image.png
i
a
always drain a node before rebooting it
c
What about shutting down the whole cluster for some maintenance of the external power supply? The default drain policy (block the last replica) won't work, AFAIU. Which drain policy would you suggest?
a
tbh i have never intentionally shut down my entire cluster, the few times it has happened accidentally it has taken some time + some prodding for everything to start back up there is an option which allows the drain if the last replica is stopped which i think is probably the best option for this situation though i havent used it the point about running drain is that it gives k8s some advance warning to move the workloads off that node before it gets shut down if it shuts down without doing that then the cluster + the kubelet on that node still think it has those workloads and it takes time for the various controllers to work out what has happened
b
etcd leader is going to be a problem.
c
Usually I shut down the workers first, then one control plane/etcd node after the other. Start in the opposite sequence, with 3 minutes delay between the management nodes.