https://rancher.com/ logo
Title
s

sparse-businessperson-74827

07/06/2022, 12:17 PM
Can anyone help me out? Longhorn started behaving strangely. Instance-manage-e pod gets terminated on all nodes every 2-5 minutes. Seeing
Liveness probe failed: dial tcp 10.42.5.68:8500: i/o timeout
when this happens but only happens on these pods none of other pods are impacted. The nodes are not overloaded also
h

hundreds-hairdresser-46043

07/07/2022, 11:43 AM
I have had similar problems like this. Also looks like this channel is not really monitored. We have gone back to rook-ceph for now - all issues went away
a

aloof-hair-13897

07/09/2022, 2:31 PM
@sparse-businessperson-74827 @hundreds-hairdresser-46043 Sorry for being late reply. Could you provide the logs of Intance-manager-e pod, or support bundle? Did you any operations when longhorn started behaving strangely?
Which OS and kubernetes distro + version are you running?
s

sparse-businessperson-74827

07/09/2022, 3:30 PM
we got it sorted
h

hundreds-hairdresser-46043

07/09/2022, 4:41 PM
@cuddly-vase-67379 For the moment we are back on rook-ceph. We have a sprint coming up where we are going to do a change over to Longhorn and properly test it. So for the moment no worries. But out of interset we are running Oracle Linux 8
@sparse-businessperson-74827 may i ask what it was? trying to avoid the same pitfall (we saw it as well)
s

sparse-businessperson-74827

07/13/2022, 3:40 PM
@hundreds-hairdresser-46043 it was due to load on nodes. The problem is that Longhorn does not set priority class by default so the pods got kicked out. I would consider persistent storage being critical part and would have expected it to be prioritized over normal workloads.
also setting priority class after you have volume is pita, as you need to take all of them down