faint-leather-9642
06/14/2024, 8:18 AMfstrim
is a standard service that runs at midnight on Sundays and it “feels” like it “causes” the failure but this is just a theory and there’s no real evidence that it’s actions are causing the failures atm. Regardless, the way to fix the problem is to delete the affected Pods.
I think this is a “known fault” because later longhorn versions can “mask” the problem by automatically deleting affected Pods?
So, every Monday, we take a look at the cluster and (more often than not) have to delete various database, redis and other such Pods that are using affected longhorn volumes. Deleting the Pod is easy but it’s annoying. Clearly it’s a known problem and the “workaround” is to delete affected Pods but …
1. what’s actually wrong?
2. does a newer longhorn fix this (without deleting Pods)?
3. are there early “signs” that we can use to reliably detect the onset of the issue (logs etc.)?
For various reasons we are unable to upgrade longhorn or k8s.
We have about 25 volumes, no obvious network bandwidth issues and plenty of cores. We’re on an OpenStack cluster with machines running fstrim weekly (on Sunday night as is normal).