This message was deleted.
# longhorn-storage
a
This message was deleted.
1
i
Well, it really depends on your machines. cc @famous-shampoo-18483 @famous-journalist-11332
f
Agree with Derek! When increase the
Concurrent Replica Rebuild Per Node Limit
, it will use more resources (CPU, RAM, network bandwidth) in your cluster and consequently slow down other operation like read/write or your application work. Really have to try it out in your particular cluster and adjust it accordingly. Additionally, if you replace node replica eviction is a must. But if you just need to reboot node, maybe you just need to drain the node. See more instructions at https://longhorn.io/docs/1.6.0/maintenance/maintenance/
👍 3
a
Great thank you both, we try and do this after hours so we can probably bump that number up then. Yes we're mostly doing this for reboots, we did try just draining the node, but that would cause random problems with replicas. It doesn't happen every time so it's difficult to reproduce, but basically we would cordon and drain the node and then reboot. Usually the running replicas would go into a failed state and then get rebuilt on a different node. But sometimes after the Longhorn node came back up, volumes with replicas on that node could not attach. I'd go to attach a volume and the 2 replicas on a different node would show as running, and the replica that was on the rebooted node would show as Stopped. The state of the volume would sit in Attaching basically forever, the only way I could clear it up would be to completely remove that longhorn node from the cluster. This doesn't happen every time, so we'd do maintenance in the evening, think everything was back up, then come in the next morning to some angry customers who can't run their containers that require a persistent volume. We're currently running Longhorn 1.3.2 in prod so it's possible this was just a bug in an older version, we'll be transitioning to a new cluster running Longhorn v1.5.3 soon so maybe it won't happen there?
i
@abundant-hair-58573 If the issue happens again, you can generate a support bundle for us.
We're currently running Longhorn 1.3.2 in prod so it's possible this was just a bug in an older version, we'll be transitioning to a new cluster running Longhorn v1.5.3 soon so maybe it won't happen there?
You can try v1.5.3 and see if the issue is gone
👍 2