This message was deleted.
# longhorn-storage
a
This message was deleted.
f
Can you send use the support bundle?
f
How long has it stuck in this state? @sticky-summer-13450
Todo: Check if there is volume crash
s
This one has been in this state since Saturday or Sunday - so around 4 days.
f
Thank you. Looks like this is a buggy state in which 2 engines are active at the same time. cc @kind-alarm-73406 @famous-shampoo-18483 Ref: https://github.com/longhorn/longhorn-manager/blob/259cf16fb34ae554c2703df2f2a9e7d08fbd2e33/controller/volume_controller.go#L518
Copy code
2022-11-01T07:45:24.117994945Z time="2022-11-01T07:45:24Z" level=warning msg="Error syncing Longhorn volume longhorn-system/pvc-8e5a70d5-a853-4082-b6f6-0ef2cd3cd8c8" controller=longhorn-volume error="failed to sync longhorn-system/pvc-8e5a70d5-a853-4082-b6f6-0ef2cd3cd8c8: failed to reconcile engine/replica state for pvc-8e5a70d5-a853-4082-b6f6-0ef2cd3cd8c8: BUG: found the second active engine pvc-8e5a70d5-a853-4082-b6f6-0ef2cd3cd8c8-e-3c989560 besides pvc-8e5a70d5-a853-4082-b6f6-0ef2cd3cd8c8-e-2a27c92a" node=harvester003
@sticky-summer-13450 You can get out of this buggy state by
kubectl edit <http://engines.longhorn.io|engines.longhorn.io> pvc-8e5a70d5-a853-4082-b6f6-0ef2cd3cd8c8-e-3c989560 -n longhorn-system
and set
spec.active
to
false
k
is there a v.spec.migrationID set?
f
No, it is not set
Copy code
spec:
    size: 10737418240
    frontend: blockdev
    frombackup: ""
    datasource: ""
    datalocality: disabled
    stalereplicatimeout: 30
    nodeid: ""
    migrationnodeid: ""
    engineimage: longhornio/longhorn-engine:v1.3.2
    backingimage: ""
    standby: false
    diskselector: []
    nodeselector: []
    disablefrontend: false
    revisioncounterdisabled: false
    lastattachedby: ""
    accessmode: rwx
    migratable: true
    encrypted: false
    numberofreplicas: 3
    replicaautobalance: ignored
    baseimage: ""
    recurringjobs: []
k
I am guessing there was a live migration of a VM and the node might have been crashed in the middle of it?
Thanks for the above, looks like the volume should detach and all engines should turn off 🙂
f
Some how there are 2 engines active at the same time and volume controller is not happy about that. Not sure how we can guarantee an atomic engines’ activeness flip in the first place
f
I just found the similar issue ticket: https://github.com/longhorn/longhorn/issues/1755
👍 1
s
You can get out of this buggy state by
kubectl edit <http://engines.longhorn.io|engines.longhorn.io> pvc-8e5a70d5-a853-4082-b6f6-0ef2cd3cd8c8-e-3c989560 -n longhorn-system
and set
spec.active
to
false
Well, applying that edit left that volume with zero replicas - so I guess I should have just blown away this broken Harvester VM days ago and started again. Never mind - it was only a k3s node, I can rebuild it. I hope the issue sees some traction and gets fixed 🙂