adamant-kite-43734
07/08/2024, 8:43 AMpowerful-librarian-10572
07/08/2024, 9:19 AMsticky-summer-13450
07/08/2024, 11:48 AMfaint-sunset-36608
07/22/2024, 2:43 PMW1019 01:11:18.316567 967 volume_path_handler_linux.go:62] couldn't find loopback device which takes file descriptor lock. Skip detaching device. device path: "19e41dfe-8cee-40c2-a39b-37c68b01c9a7"
W1019 01:11:18.316582 967 volume_path_handler.go:217] Warning: Unmap skipped because symlink does not exist on the path: /var/lib/kubelet/pods/19e41dfe-8cee-40c2-a39b-37c68b01c9a7/volumeDevices/kubernetes.io~csi/pvc-37461bdd-7133-4111-8fb7-c4ddf8dacca5
E1019 01:11:18.316662 967 nestedpendingoperations.go:348] Operation for "{volumeName:<http://kubernetes.io/csi/driver.longhorn.io^pvc-37461bdd-7133-4111-8fb7-c4ddf8dacca5|kubernetes.io/csi/driver.longhorn.io^pvc-37461bdd-7133-4111-8fb7-c4ddf8dacca5> podName:19e41dfe-8cee-40c2-a39b-37c68b01c9a7 nodeName:}" failed. No retries permitted until 2023-10-19 01:13:20.316609446 +0000 UTC m=+1883.799156551 (durationBeforeRetry 2m2s). Error: UnmapVolume.UnmapBlockVolume failed for volume "pvc-37461bdd-7133-4111-8fb7-c4ddf8dacca5" (UniqueName: "<http://kubernetes.io/csi/driver.longhorn.io^pvc-37461bdd-7133-4111-8fb7-c4ddf8dacca5|kubernetes.io/csi/driver.longhorn.io^pvc-37461bdd-7133-4111-8fb7-c4ddf8dacca5>") pod "19e41dfe-8cee-40c2-a39b-37c68b01c9a7" (UID: "19e41dfe-8cee-40c2-a39b-37c68b01c9a7") : blkUtil.DetachFileDevice failed. globalUnmapPath:, podUID: 19e41dfe-8cee-40c2-a39b-37c68b01c9a7, bindMount: true: failed to unmap device from map path. mapPath is empty
The general solution to the problem is to determine why Kubernetes has the Longhorn volume attached to an extra node and try to get it to detach. Directly deleting the Kubernetes VolumeAttachment object will have this effect, but may have unintended consequences (as Kubernetes generally cleans up VolumeAttachments in due course once requirements for doing so have been met.) It is usually obvious why it is attached to one node (since there is a VM workload using the volume running on it). Normally, a second node is only attached during intentional migration. Otherwise, there may be some unexpected reason (e.g. the one above or some pod still running on the node that has not fully terminated).sticky-summer-13450
07/22/2024, 4:16 PMharvester001
(I'm good with naming) has the suggested phrase in kubelet logs /var/lib/rancher/rke2/agent/logs/kubelet.log
W0614 17:28:55.934787 15084 volume_path_handler_linux.go:62] couldn't find loopback device which takes file descriptor lock. Skip detaching device. device path: "/var/lib/kubelet/plugins/kubernetes.io/csi/vo
lumeDevices/pvc-86efbb6c-de2f-4a25-b2b6-603a3f2fa45f/dev/e82317b5-dde8-40f2-8fdc-af0c655a3e5a"
The VM with its root disk in this state is shut down.
I'm slightly apprehensive by the remedy "try to get it to detatch". Do I delete the <http://longhorn.io|longhorn.io>
VolumeAttachment
named pvc-86efbb6c-de2f-4a25-b2b6-603a3f2fa45f
?faint-sunset-36608
07/22/2024, 4:30 PMeweber@laptop:~/longhorn-manager> kubectl get -n longhorn-system volumeattachments.longhorn.io
NAME AGE
pvc-936694f5-6f7f-4593-90e9-7eb5faedfee1 40s
Deleting these is a POTENTIAL strategy, but not preferred. You probably have two that reference the same PV while the VM is online. With the VM offline, you probably only have one.
eweber@laptop:~/longhorn-manager> kl get volumeattachment
NAME ATTACHER PV NODE ATTACHED AGE
csi-03f4bc0377f22c000f78d1c7396827b402430321e1e467ab620423a0271b8a6e driver.longhorn.io pvc-936694f5-6f7f-4593-90e9-7eb5faedfee1 eweber-v126-worker-9c1451b4-kgxdq true 78s
To be clear, if you are experiencing the issue I was discussing above (with that log message), kubelet is probably still logging the message repeatedly, even now. (The message you provided is pretty old, so I don't know the current status.) If that is the case, the strategy should be to restart the kubelet on the offending node. Deleting the VolumeAttachment object directly will confuse the kubelet, which is actively trying to detach it, but failing due to the upstream Kubernetes bug. It's probably good to confirm that other volumes aren't ALSO stuck migrating from/to the "faulty" node, because restarting the kubelet can cause volumes in this state to crash and require rebuilding. (That particular issue is resolved on the Longhorn side in later versions, but I still think it's better to be safe.)
If the kubelet is NOT logging repeatedly still and the VM is stopped, I don't know off the top of my head why there should be a rogue VolumeAttachment. Feel free to upload a support bundle or send it to longhorn-support-bundle@suse.com and I should be able to take a quick look sometime today to see if there is an obvious reason related to a known issue (or some new issue I'm not aware of),sticky-summer-13450
07/22/2024, 4:37 PMfaint-sunset-36608
07/22/2024, 7:09 PMpvc-86efbb6c-de2f-4a25-b2b6-603a3f2fa45f
today. Maybe I was just too late in my response to catch the negative effects?sticky-summer-13450
07/22/2024, 8:31 PMsticky-summer-13450
07/22/2024, 8:32 PMfaint-sunset-36608
07/22/2024, 8:33 PMsticky-summer-13450
07/22/2024, 8:34 PMfaint-sunset-36608
07/22/2024, 8:35 PMsticky-summer-13450
07/22/2024, 8:35 PMfaint-sunset-36608
07/22/2024, 8:40 PMconcurrent-automatic-engine-upgrade-per-node-limit == 3
, it may be the case that the rebuild is waiting on the engine upgrade. (It looks like this volume engine is still using v1.4.3
.) Hopefully the support bundle indicates why the upgrade is blocked (if that is what's going on).faint-sunset-36608
07/22/2024, 9:11 PMsticky-summer-13450
07/22/2024, 9:16 PMfaint-sunset-36608
07/22/2024, 9:26 PMpvc-86efbb6c-de2f-4a25-b2b6-603a3f2fa45f-e-72b58d18
, which Longhorn also thinks is the active one. So it seems to me the correct mitigation is to:
1. Turn the VM off again and wait for the volume to detach.
2. Try to kubectl delete -n longhorn-system pvc-86efbb6c-de2f-4a25-b2b6-603a3f2fa45f-e-f5af50fa
. This is NOT the active one and not the one referenced by the existing replicas.
3. Use kubectl get -n longhorn-system engine pvc-86efbb6c-de2f-4a25-b2b6-603a3f2fa45f-e-f5af50fa
to confirm the engine is fully removed.
4. Turn the VM back on again.
I won't be able to get to it today, but I'll try to do an RCA on this situation to understand how it happened. It will be very helpful if you can provide some additional details:
1. What version of Harvester was originally installed.
2. When did you upgrade?
3. What version of Harvester did you upgrade to? Did you go through interim versions?
4. Was the volume usable between the upgrade and now?sticky-summer-13450
07/22/2024, 9:33 PMsticky-summer-13450
07/23/2024, 7:06 AMsticky-summer-13450
07/23/2024, 7:42 AMfaint-sunset-36608
07/23/2024, 3:11 PM