This message was deleted.
# harvester
a
This message was deleted.
s
Agree, most scenarios would not be happy with 15-minute failover. Just would like to clarify some things. • What did you simulate for the HA scenarios? (shut down the node, break the network connection, or something else?) • Is the volume hotplug? • How did you check the Longhorn knowing the node is dead (Could you check the im-e/im-r pods?) I would also like to do some checks for it. Thanks for you raising this!
f
So, 3 node cluster. The test was a shutdown of the node. Tweaked the rke2 settings for k8s: kubelet: node-status-update-frequency=4s (from 10s) controller-manager: node-monitor-period=4s (from 5s) node-monitor-grace-period=16s (from 40s) pod-eviction-timeout=30s (from 5m) kube-apiserver: default-not-ready-toleration-seconds 30 default-unreachable-toleration-seconds 30 K8s then labels the node as unreachable after about 1 minute, at the same time the node goes unreachable in Rancher, it also goes unreachable in Harvester, and 'Down' in the longhorn UI on the Node page. So that's pretty quick. At this point the node is down in longhorn, but the volume remains attached to the dead node for around 5-7 minutes (feels like another default timer somewhere). Whilst this is happening K8s is erroring "Failed to attach PVC... to pod...". Then it detaches, reattaches to a remaining node, and repairing the degradation/binding to the new pod/VMI takes a few more minutes.
I'll look up what the im-e/im-r pods are doing and let you know.
s
Hi @full-train-34126, https://github.com/harvester/harvester/issues/4049 I thought that would help your case. Also, this is involved in v1.2.0. Maybe you could try after v1.2.0 release
f
Thanks @salmon-city-57654. Seems like it's all in hand, that's exactly the issue I was experiencing. I'll upgrade to the 1.2.0 release when it's available.