This message was deleted.
# longhorn-storage
a
This message was deleted.
l
We need to be careful here to avoid potentail dataloss as you’re now in a situation where your down to the last replica - and we don’t know if it’s healthy.
s
cc @famous-shampoo-18483
f
• If all replicas of a volume become unavailable, please follow this doc to do data recovery first: http://localhost:1313/docs/1.3.0/advanced-resources/data-recovery/ • I guess something may be wrong with volume live upgrade. If the volume is not upgraded, all replicas should be the old instance managers and Longhorn won’t clean up the old instance managers that still hold engines/replicas. Can you provide the support bundle?
s
How can I get the support bundle? I ended up resolving this by restoring my most important volumes from the backups, but I think I may still have a couple of volumes still in this state.
f
On the bottom left of the Longhorn UI page, there is button
Generate Support Bundle
s
Ok, I have the bundle. Do I just put it here?
f
You can send it to Longhorn Support Bundle <longhorn-support-bundle@Suse.com> And it’s better to point out the problematic volume/node name. Thank you!
s
Alright, submitted. Thanks!
f
For volume
pvc-3b98bec7-20a3-4f1d-9ba8-c66ce8237075
(the related PVC is
data-consul-consul-server-0
), the only replica is on the NotReady node
k3s-e
hence Longhorn won’t set
status.instancemanagername
. --- For volume
pvc-ca1cee19-a051-483e-9c38-626a444b4678
(the related PVC is
data-consul-consul-server-1
), things are weird… The 2 replicas of this volume are on 2 Ready nodes
k3s-a
and
k3s-b
, but somehow the instance manager name is not set either. According to the only log in the longhorn manager pod, the reconciliation of the replica may keep failed or blocking hence the replica controller in the longhorn manager pod cannot update the replica status:
Copy code
2022-06-22T00:07:45.571613465-04:00 I0622 04:07:45.571313       1 trace.go:205] Trace[1073437889]: "DeltaFIFO Pop Process" ID:longhorn-system/pvc-ca1cee19-a051-483e-9c38-626a444b4678-r-cc40527f,Depth:21,Reason:slow event handlers blocking the queue (22-Jun-2022 04:07:45.446) (total time: 125ms):
2022-06-22T00:07:45.571701075-04:00 Trace[1073437889]: [125.091157ms] [125.091157ms] END
The next is, why can’t the replica controller update these 2 replicas? Do other replicas on the 2 nodes works fine? You can verify it by attaching or detaching volumes that holds replicas on these 2 nodes.
s
The
k3s-e
node is
NotReady
because I cordoned it off because it was causing all sorts of problems, but it was
Ready
for most of the time I was troubleshooting this, so I don't think that's the issue, and I rather believe that both of these volumes were having the same problem. Keep in mind that I had 13 other volumes across 5 nodes and they were all having the same symptoms. Regarding the logs, what are these
slow event handlers
it's referring to? Is this on the Kubernetes side? Or on the Longhorn control plane?
f
It’s caused by the work queue in longhorn manager. I will continue checking it tomorrow.
s
I did try attaching the volumes to nodes manually (not these volumes, but others that I more urgently needed to restore), but I saw the same issue, they were stuck on
Attaching
forever. Thank you for looking into this. You can take your time, this is not for a production system. 🙂
f
You can create a brand new volume (for testing purpose) then attach it a random node. If the attachment is pretty fast, then the nodes the volume engine/replicas reside on should be fine. But based on your previous result, it seems that the work queue of the controllers in the longhorn managers are blocked hence the attachment/detachment become pretty slow.
IIRC, Longhorn could show the work queue info via metrics. @famous-journalist-11332 Can you provide the instructions?
BTW, you can check if the API server and ETCD of your cluster work fine first. If YES, then the blocking is probably caused by Longhorn itself.