Hi I m running longhorn 1 3 3 on a k8s 1 23 16 cluster and h Rancher Users #longhorn-storage

Hi, I’m running longhorn 1.3.3 on a k8s 1.23.16 cl...

faint-leather-9642

06/14/2024, 8:18 AM

Hi, I’m running longhorn 1.3.3 on a k8s 1.23.16 cluster and have consistent and regular failures shortly after midnight on Sundays. Pods (that use longhorn volumes) enter a “*Read Only File System*” state.

fstrim

is a standard service that runs at midnight on Sundays and it “feels” like it “causes” the failure but this is just a theory and there’s no real evidence that it’s actions are causing the failures atm. Regardless, the way to fix the problem is to delete the affected Pods. I think this is a “known fault” because later longhorn versions can “mask” the problem by automatically deleting affected Pods? So, every Monday, we take a look at the cluster and (more often than not) have to delete various database, redis and other such Pods that are using affected longhorn volumes. Deleting the Pod is easy but it’s annoying. Clearly it’s a known problem and the “workaround” is to delete affected Pods but … 1. what’s actually wrong? 2. does a newer longhorn fix this (without deleting Pods)? 3. are there early “signs” that we can use to reliably detect the onset of the issue (logs etc.)? For various reasons we are unable to upgrade longhorn or k8s. We have about 25 volumes, no obvious network bandwidth issues and plenty of cores. We’re on an OpenStack cluster with machines running fstrim weekly (on Sunday night as is normal).

10 Views

Open in Slack

Previous Next