This message was deleted.
# harvester
Hi @quaint-alarm-7893, I will take a look at this SB. Do you remember what operation you do before the cluster gets stuck?
thanks @salmon-city-57654, it usually seems like it's when i put a load on the cluster pushing/pulling data, but the most recent time it had an issue, it was just running day-to-day at-rest stuff. i'm not sure if i have a bad switch in the mix, or a node with some sort of issue, or disks. but just randomly goes into a tail spin. most recently my issue is vms lock up w/ an io error, one of them being my gateway (it's a software gateway called untangle, similar to pfsense). when i lose it and i try to bring it back online the vm goes into a perpetual reboot and i have to restore it from a backup. with it, several other vms lock up w/ io errors. but i cant seem to exactly what causes the io error because of the obfuscation w/ replicas, salvage, and all of the things longhorn does. sorry. kinda rambling here... 😕
@salmon-city-57654 you find anything useful?
hi @quaint-alarm-7893, I check the SB and see some IOError events but I do not find the some useful information now. Could you provide the specific IOError volume? I see the two IOError events 1.
of natimarkwebdb 2.
of addrbbox Does these two VMs lockup as you mentioned above? Also, I see some error on the kernel log on the node
is located. Could you check your hardware status? Thanks!
@salmon-city-57654 the issue is it seems to move around. it's not always the same vm, it's not always the same node. just had another issue now, everything was replicated and happy, not a heavy load, but all of the sudden, i'm replicating a bunch of degraded volumes. i've been following the dmesg logs on each node, and only one node has an error. harvester-05. i just dumped both bundles (lh and har) dmesg points to devices that are no longer attached to the node, so i'm unsure how to track down where the problem is.
now it's spamming dmesg with io erros on dm-1 and dm-2, which map to lvms for a vm that's not even on, or mounted. it's like it's stuck trying to access a volume not even mounted any more.
@salmon-city-57654 bump RE Git issue 3843
Hi @quaint-alarm-7893, sorry I am missing this thread. What is the latest situation now? Do you still meet the random IO error with your VMs? If yes, could you try to generate the SB when you meet the IO error on your VM. and attach it here. I want to check it again. Thanks.
it still does it periodically yes, i can try and clear out everything tonight, and see if i can get it to do it again and i'll post back w/ an updated bundle
Thanks @quaint-alarm-7893. Once we have the latest bundle we can check it again.