This message was deleted Rancher Users #longhorn-storage

Join Slack

This message was deleted.

# longhorn-storage

adamant-kite-43734

06/20/2022, 6:50 PM

This message was deleted.

late-needle-80860

06/20/2022, 6:55 PM

We need to be careful here to avoid potentail dataloss as you’re now in a situation where your down to the last replica - and we don’t know if it’s healthy.

salmon-doctor-9726

06/21/2022, 1:32 AM

cc @famous-shampoo-18483

famous-shampoo-18483

06/21/2022, 2:30 AM

• If all replicas of a volume become unavailable, please follow this doc to do data recovery first: http://localhost:1313/docs/1.3.0/advanced-resources/data-recovery/ • I guess something may be wrong with volume live upgrade. If the volume is not upgraded, all replicas should be the old instance managers and Longhorn won’t clean up the old instance managers that still hold engines/replicas. Can you provide the support bundle?

stocky-beard-10620

06/21/2022, 3:19 PM

How can I get the support bundle? I ended up resolving this by restoring my most important volumes from the backups, but I think I may still have a couple of volumes still in this state.

famous-shampoo-18483

06/22/2022, 3:52 AM

On the bottom left of the Longhorn UI page, there is button

Generate Support Bundle

stocky-beard-10620

06/22/2022, 4:39 AM

Ok, I have the bundle. Do I just put it here?

famous-shampoo-18483

06/22/2022, 4:43 AM

You can send it to Longhorn Support Bundle <longhorn-support-bundle@Suse.com> And it’s better to point out the problematic volume/node name. Thank you!

stocky-beard-10620

06/22/2022, 3:28 PM

Alright, submitted. Thanks!

famous-shampoo-18483

06/23/2022, 12:38 PM

For volume

pvc-3b98bec7-20a3-4f1d-9ba8-c66ce8237075

(the related PVC is

data-consul-consul-server-0

), the only replica is on the NotReady node

k3s-e

hence Longhorn won’t set

status.instancemanagername

. --- For volume

pvc-ca1cee19-a051-483e-9c38-626a444b4678

(the related PVC is

data-consul-consul-server-1

), things are weird… The 2 replicas of this volume are on 2 Ready nodes

k3s-a

and

k3s-b

, but somehow the instance manager name is not set either. According to the only log in the longhorn manager pod, the reconciliation of the replica may keep failed or blocking hence the replica controller in the longhorn manager pod cannot update the replica status:

Copy code

2022-06-22T00:07:45.571613465-04:00 I0622 04:07:45.571313       1 trace.go:205] Trace[1073437889]: "DeltaFIFO Pop Process" ID:longhorn-system/pvc-ca1cee19-a051-483e-9c38-626a444b4678-r-cc40527f,Depth:21,Reason:slow event handlers blocking the queue (22-Jun-2022 04:07:45.446) (total time: 125ms):
2022-06-22T00:07:45.571701075-04:00 Trace[1073437889]: [125.091157ms] [125.091157ms] END

famous-shampoo-18483

06/23/2022, 12:42 PM

The next is, why can’t the replica controller update these 2 replicas? Do other replicas on the 2 nodes works fine? You can verify it by attaching or detaching volumes that holds replicas on these 2 nodes.

stocky-beard-10620

06/23/2022, 12:44 PM

The

k3s-e

node is

NotReady

because I cordoned it off because it was causing all sorts of problems, but it was

Ready

for most of the time I was troubleshooting this, so I don't think that's the issue, and I rather believe that both of these volumes were having the same problem. Keep in mind that I had 13 other volumes across 5 nodes and they were all having the same symptoms. Regarding the logs, what are these

slow event handlers

it's referring to? Is this on the Kubernetes side? Or on the Longhorn control plane?

famous-shampoo-18483

06/23/2022, 12:45 PM

It’s caused by the work queue in longhorn manager. I will continue checking it tomorrow.

stocky-beard-10620

06/23/2022, 12:45 PM

I did try attaching the volumes to nodes manually (not these volumes, but others that I more urgently needed to restore), but I saw the same issue, they were stuck on

Attaching

forever. Thank you for looking into this. You can take your time, this is not for a production system. 🙂

famous-shampoo-18483

06/23/2022, 12:48 PM

You can create a brand new volume (for testing purpose) then attach it a random node. If the attachment is pretty fast, then the nodes the volume engine/replicas reside on should be fine. But based on your previous result, it seems that the work queue of the controllers in the longhorn managers are blocked hence the attachment/detachment become pretty slow.

famous-shampoo-18483

06/23/2022, 12:51 PM

IIRC, Longhorn could show the work queue info via metrics. @famous-journalist-11332 Can you provide the instructions?

famous-shampoo-18483

06/23/2022, 12:53 PM

BTW, you can check if the API server and ETCD of your cluster work fine first. If YES, then the blocking is probably caused by Longhorn itself.

33 Views

Open in Slack

Previous Next