This message was deleted Rancher Users #harvester

Join Slack

This message was deleted.

# harvester

adamant-kite-43734

03/08/2023, 2:54 AM

This message was deleted.

supportbundle_9042f568-6514-4a9f-a6c3-96a342641671_2023-03-08T02-52-11Z.zip

👋 2

salmon-city-57654

03/09/2023, 2:21 PM

Hi @quaint-alarm-7893, I will take a look at this SB. Do you remember what operation you do before the cluster gets stuck?

quaint-alarm-7893

03/09/2023, 4:13 PM

thanks @salmon-city-57654, it usually seems like it's when i put a load on the cluster pushing/pulling data, but the most recent time it had an issue, it was just running day-to-day at-rest stuff. i'm not sure if i have a bad switch in the mix, or a node with some sort of issue, or disks. but just randomly goes into a tail spin. most recently my issue is vms lock up w/ an io error, one of them being my gateway (it's a software gateway called untangle, similar to pfsense). when i lose it and i try to bring it back online the vm goes into a perpetual reboot and i have to restore it from a backup. with it, several other vms lock up w/ io errors. but i cant seem to exactly what causes the io error because of the obfuscation w/ replicas, salvage, and all of the things longhorn does. sorry. kinda rambling here... 😕

quaint-alarm-7893

03/11/2023, 5:15 AM

@salmon-city-57654 you find anything useful?

salmon-city-57654

03/13/2023, 5:52 PM

hi @quaint-alarm-7893, I check the SB and see some IOError events but I do not find the some useful information now. Could you provide the specific IOError volume? I see the two IOError events 1.

disk-4

of natimarkwebdb 2.

root

of addrbbox Does these two VMs lockup as you mentioned above? Also, I see some error on the kernel log on the node

harvester-1

which

natimarkwebdb

is located. Could you check your hardware status? Thanks!

quaint-alarm-7893

03/15/2023, 4:06 PM

@salmon-city-57654 the issue is it seems to move around. it's not always the same vm, it's not always the same node. just had another issue now, everything was replicated and happy, not a heavy load, but all of the sudden, i'm replicating a bunch of degraded volumes. i've been following the dmesg logs on each node, and only one node has an error. harvester-05. i just dumped both bundles (lh and har) dmesg points to devices that are no longer attached to the node, so i'm unsure how to track down where the problem is.

supportbundle_9042f568-6514-4a9f-a6c3-96a342641671_2023-03-15T15-57-14Z.zip harvester-05-dmesg.txt longhorn-support-bundle_b0b85b06-bd06-434b-95e8-c47b11be903f_2023-03-15T15-57-00Z.zip

quaint-alarm-7893

03/15/2023, 5:07 PM

now it's spamming dmesg with io erros on dm-1 and dm-2, which map to lvms for a vm that's not even on, or mounted. it's like it's stuck trying to access a volume not even mounted any more.

harvester-05-dmesg.txt

quaint-alarm-7893

05/02/2023, 4:28 PM

@salmon-city-57654 bump RE Git issue 3843 https://github.com/harvester/harvester/issues/3843

salmon-city-57654

05/03/2023, 11:28 AM

Hi @quaint-alarm-7893, sorry I am missing this thread. What is the latest situation now? Do you still meet the random IO error with your VMs? If yes, could you try to generate the SB when you meet the IO error on your VM. and attach it here. I want to check it again. Thanks.

quaint-alarm-7893

05/03/2023, 3:05 PM

it still does it periodically yes, i can try and clear out everything tonight, and see if i can get it to do it again and i'll post back w/ an updated bundle

salmon-city-57654

05/03/2023, 4:26 PM

Thanks @quaint-alarm-7893. Once we have the latest bundle we can check it again.

6 Views

Open in Slack

Previous Next