stocky-beard-10620
06/20/2022, 6:50 PMStopped
, Failed
, or Unknown
. If I inspect a given volume, it tells me that the only replica is "running" on an instance-manager with a name that doesn't match any of the currently running instance-manager pods (neither the -e
nor the -r
ones). Is it possible that during the upgrade, the connection between replicas and instance-managers got lost or out of sync and now it has to be fixed manually?late-needle-80860
06/20/2022, 6:55 PMsalmon-doctor-9726
06/21/2022, 1:32 AMfamous-shampoo-18483
06/21/2022, 2:30 AMstocky-beard-10620
06/21/2022, 3:19 PMfamous-shampoo-18483
06/22/2022, 3:52 AMGenerate Support Bundle
stocky-beard-10620
06/22/2022, 4:39 AMfamous-shampoo-18483
06/22/2022, 4:43 AMstocky-beard-10620
06/22/2022, 3:28 PMfamous-shampoo-18483
06/23/2022, 12:38 PMpvc-3b98bec7-20a3-4f1d-9ba8-c66ce8237075
(the related PVC is data-consul-consul-server-0
), the only replica is on the NotReady node k3s-e
hence Longhorn won’t set status.instancemanagername
.
---
For volume pvc-ca1cee19-a051-483e-9c38-626a444b4678
(the related PVC is data-consul-consul-server-1
), things are weird… The 2 replicas of this volume are on 2 Ready nodes k3s-a
and k3s-b
, but somehow the instance manager name is not set either.
According to the only log in the longhorn manager pod, the reconciliation of the replica may keep failed or blocking hence the replica controller in the longhorn manager pod cannot update the replica status:
2022-06-22T00:07:45.571613465-04:00 I0622 04:07:45.571313 1 trace.go:205] Trace[1073437889]: "DeltaFIFO Pop Process" ID:longhorn-system/pvc-ca1cee19-a051-483e-9c38-626a444b4678-r-cc40527f,Depth:21,Reason:slow event handlers blocking the queue (22-Jun-2022 04:07:45.446) (total time: 125ms):
2022-06-22T00:07:45.571701075-04:00 Trace[1073437889]: [125.091157ms] [125.091157ms] END
stocky-beard-10620
06/23/2022, 12:44 PMk3s-e
node is NotReady
because I cordoned it off because it was causing all sorts of problems, but it was Ready
for most of the time I was troubleshooting this, so I don't think that's the issue, and I rather believe that both of these volumes were having the same problem. Keep in mind that I had 13 other volumes across 5 nodes and they were all having the same symptoms.
Regarding the logs, what are these slow event handlers
it's referring to? Is this on the Kubernetes side? Or on the Longhorn control plane?famous-shampoo-18483
06/23/2022, 12:45 PMstocky-beard-10620
06/23/2022, 12:45 PMAttaching
forever.
Thank you for looking into this. You can take your time, this is not for a production system. 🙂famous-shampoo-18483
06/23/2022, 12:48 PM