https://rancher.com/ logo
Title
s

sticky-summer-13450

10/30/2022, 2:24 PM
What does it mean when a volume has two instance manager's listed, and errors saying there are two instance managers, and the volume is healthy and not ready, and attached when the Harvester VM is not running. It seems to be a Schrodinger's volume... I think it needs to be unattached with no instance managers, so that it can be attached again when Harvester commands it to be.
f

famous-journalist-11332

10/31/2022, 10:12 PM
Can you send use the support bundle?
f

famous-journalist-11332

11/02/2022, 2:01 AM
How long has it stuck in this state? @sticky-summer-13450
Todo: Check if there is volume crash
s

sticky-summer-13450

11/02/2022, 7:57 AM
This one has been in this state since Saturday or Sunday - so around 4 days.
f

famous-journalist-11332

11/03/2022, 11:41 PM
Thank you. Looks like this is a buggy state in which 2 engines are active at the same time. cc @kind-alarm-73406 @famous-shampoo-18483 Ref: https://github.com/longhorn/longhorn-manager/blob/259cf16fb34ae554c2703df2f2a9e7d08fbd2e33/controller/volume_controller.go#L518
2022-11-01T07:45:24.117994945Z time="2022-11-01T07:45:24Z" level=warning msg="Error syncing Longhorn volume longhorn-system/pvc-8e5a70d5-a853-4082-b6f6-0ef2cd3cd8c8" controller=longhorn-volume error="failed to sync longhorn-system/pvc-8e5a70d5-a853-4082-b6f6-0ef2cd3cd8c8: failed to reconcile engine/replica state for pvc-8e5a70d5-a853-4082-b6f6-0ef2cd3cd8c8: BUG: found the second active engine pvc-8e5a70d5-a853-4082-b6f6-0ef2cd3cd8c8-e-3c989560 besides pvc-8e5a70d5-a853-4082-b6f6-0ef2cd3cd8c8-e-2a27c92a" node=harvester003
@sticky-summer-13450 You can get out of this buggy state by
kubectl edit <http://engines.longhorn.io|engines.longhorn.io> pvc-8e5a70d5-a853-4082-b6f6-0ef2cd3cd8c8-e-3c989560 -n longhorn-system
and set
spec.active
to
false
k

kind-alarm-73406

11/03/2022, 11:42 PM
is there a v.spec.migrationID set?
f

famous-journalist-11332

11/03/2022, 11:43 PM
No, it is not set
spec:
    size: 10737418240
    frontend: blockdev
    frombackup: ""
    datasource: ""
    datalocality: disabled
    stalereplicatimeout: 30
    nodeid: ""
    migrationnodeid: ""
    engineimage: longhornio/longhorn-engine:v1.3.2
    backingimage: ""
    standby: false
    diskselector: []
    nodeselector: []
    disablefrontend: false
    revisioncounterdisabled: false
    lastattachedby: ""
    accessmode: rwx
    migratable: true
    encrypted: false
    numberofreplicas: 3
    replicaautobalance: ignored
    baseimage: ""
    recurringjobs: []
k

kind-alarm-73406

11/03/2022, 11:43 PM
I am guessing there was a live migration of a VM and the node might have been crashed in the middle of it?
Thanks for the above, looks like the volume should detach and all engines should turn off 🙂
f

famous-journalist-11332

11/03/2022, 11:45 PM
Some how there are 2 engines active at the same time and volume controller is not happy about that. Not sure how we can guarantee an atomic engines’ activeness flip in the first place
f

famous-shampoo-18483

11/04/2022, 2:04 AM
I just found the similar issue ticket: https://github.com/longhorn/longhorn/issues/1755
👍 1
s

sticky-summer-13450

11/04/2022, 9:17 AM
You can get out of this buggy state by
kubectl edit <http://engines.longhorn.io|engines.longhorn.io> pvc-8e5a70d5-a853-4082-b6f6-0ef2cd3cd8c8-e-3c989560 -n longhorn-system
and set
spec.active
to
false
Well, applying that edit left that volume with zero replicas - so I guess I should have just blown away this broken Harvester VM days ago and started again. Never mind - it was only a k3s node, I can rebuild it. I hope the issue sees some traction and gets fixed 🙂