👋 We have an app that orchestrates "static" pods on the cluster, just like Longhorn instance manager pods. Those "static" pods are used as dev environment and mounts Longhorn volumes. They are considered stateful.
Occasionally our system load becomes high and Longhorn instance manager would restart. That causes I/O error in those "static" pods from the same node. We then have to restart those pods.
Is there something we can do to make this more resilient? We cannot guarantee that Longhorn instance manager never restarts. It's ok if there is a brief disruption, but recover from it quickly. Having to restart the pods is quite disruptive