This message was deleted Rancher Users #longhorn-storage

Join Slack

This message was deleted.

# longhorn-storage

adamant-kite-43734

07/08/2024, 8:43 AM

This message was deleted.

powerful-librarian-10572

07/08/2024, 9:19 AM

Your volume can have X replicas. By default it's 3 but you can have changedthe default value. Check "update replica count" in the volume menu

sticky-summer-13450

07/08/2024, 11:48 AM

The issue are that this volume is degraded and, as the error at the top suggests, that this volume has more than one engine attached. I expect that the replicas will return to the expected 3 when the error is fixed.

faint-sunset-36608

07/22/2024, 2:43 PM

According to Longhorn, the volume is migrating between two nodes. This means that there are two Kubernetes VolumeAttachments referencing the volume in the cluster. The volume will remain degraded until one of the VolumeAttachments is deleted, either confirming the migration or rolling it back. (Replica rebuilds can't happen during migrations.) Usually, volume migrations complete quickly without issue, but some difficulties have been uncovered in older versions of Longhorn used by older versions of Harvester. In one such issue (which is caused by an upstream Kubelet bug), when a cluster node restarts without being drained, the volume remains attached to the restarted node indefinitely, even though Harvester starts up VMs that use the volume on a new node. This is known to happen in Harvester v1.2.1 with Longhorn v1.4.3. See https://github.com/harvester/harvester/issues/5048 and https://github.com/harvester/harvester/issues/4633 for details. The specific solution to that one is to determine which node has kublelet logs like the following related to the volume and restart kubelet on that node. It is better to do so with the associated VM powered down if possible.

Copy code

W1019 01:11:18.316567     967 volume_path_handler_linux.go:62] couldn't find loopback device which takes file descriptor lock. Skip detaching device. device path: "19e41dfe-8cee-40c2-a39b-37c68b01c9a7"
W1019 01:11:18.316582     967 volume_path_handler.go:217] Warning: Unmap skipped because symlink does not exist on the path: /var/lib/kubelet/pods/19e41dfe-8cee-40c2-a39b-37c68b01c9a7/volumeDevices/kubernetes.io~csi/pvc-37461bdd-7133-4111-8fb7-c4ddf8dacca5
E1019 01:11:18.316662     967 nestedpendingoperations.go:348] Operation for "{volumeName:<http://kubernetes.io/csi/driver.longhorn.io^pvc-37461bdd-7133-4111-8fb7-c4ddf8dacca5|kubernetes.io/csi/driver.longhorn.io^pvc-37461bdd-7133-4111-8fb7-c4ddf8dacca5> podName:19e41dfe-8cee-40c2-a39b-37c68b01c9a7 nodeName:}" failed. No retries permitted until 2023-10-19 01:13:20.316609446 +0000 UTC m=+1883.799156551 (durationBeforeRetry 2m2s). Error: UnmapVolume.UnmapBlockVolume failed for volume "pvc-37461bdd-7133-4111-8fb7-c4ddf8dacca5" (UniqueName: "<http://kubernetes.io/csi/driver.longhorn.io^pvc-37461bdd-7133-4111-8fb7-c4ddf8dacca5|kubernetes.io/csi/driver.longhorn.io^pvc-37461bdd-7133-4111-8fb7-c4ddf8dacca5>") pod "19e41dfe-8cee-40c2-a39b-37c68b01c9a7" (UID: "19e41dfe-8cee-40c2-a39b-37c68b01c9a7") : blkUtil.DetachFileDevice failed. globalUnmapPath:, podUID: 19e41dfe-8cee-40c2-a39b-37c68b01c9a7, bindMount: true: failed to unmap device from map path. mapPath is empty

The general solution to the problem is to determine why Kubernetes has the Longhorn volume attached to an extra node and try to get it to detach. Directly deleting the Kubernetes VolumeAttachment object will have this effect, but may have unintended consequences (as Kubernetes generally cleans up VolumeAttachments in due course once requirements for doing so have been met.) It is usually obvious why it is attached to one node (since there is a VM workload using the volume running on it). Normally, a second node is only attached during intentional migration. Otherwise, there may be some unexpected reason (e.g. the one above or some pod still running on the node that has not fully terminated).

sticky-summer-13450

07/22/2024, 4:16 PM

Hi @faint-sunset-36608 - thanks for your assistance. The cluster was running Harvester v1.2.1 for a short while, before I could upgrade to v1.3.1. I have found that the node

harvester001

(I'm good with naming) has the suggested phrase in kubelet logs

/var/lib/rancher/rke2/agent/logs/kubelet.log

Copy code

W0614 17:28:55.934787   15084 volume_path_handler_linux.go:62] couldn't find loopback device which takes file descriptor lock. Skip detaching device. device path: "/var/lib/kubelet/plugins/kubernetes.io/csi/vo
lumeDevices/pvc-86efbb6c-de2f-4a25-b2b6-603a3f2fa45f/dev/e82317b5-dde8-40f2-8fdc-af0c655a3e5a"

The VM with its root disk in this state is shut down. I'm slightly apprehensive by the remedy "try to get it to detatch". Do I delete the

<http://longhorn.io|longhorn.io>

VolumeAttachment

named

pvc-86efbb6c-de2f-4a25-b2b6-603a3f2fa45f

faint-sunset-36608

07/22/2024, 4:30 PM

No, please do not delete that. It is an internal Longhorn implementation detail and probably not at all the culprit here. I am referring to the Kubernetes VolumeAttachment objects. We definitely do NOT want to delete these:

Copy code

eweber@laptop:~/longhorn-manager> kubectl get -n longhorn-system volumeattachments.longhorn.io
NAME                                       AGE
pvc-936694f5-6f7f-4593-90e9-7eb5faedfee1   40s

Deleting these is a POTENTIAL strategy, but not preferred. You probably have two that reference the same PV while the VM is online. With the VM offline, you probably only have one.

Copy code

eweber@laptop:~/longhorn-manager> kl get volumeattachment
NAME                                                                   ATTACHER             PV                                         NODE                                ATTACHED   AGE
csi-03f4bc0377f22c000f78d1c7396827b402430321e1e467ab620423a0271b8a6e   driver.longhorn.io   pvc-936694f5-6f7f-4593-90e9-7eb5faedfee1   eweber-v126-worker-9c1451b4-kgxdq   true       78s

To be clear, if you are experiencing the issue I was discussing above (with that log message), kubelet is probably still logging the message repeatedly, even now. (The message you provided is pretty old, so I don't know the current status.) If that is the case, the strategy should be to restart the kubelet on the offending node. Deleting the VolumeAttachment object directly will confuse the kubelet, which is actively trying to detach it, but failing due to the upstream Kubernetes bug. It's probably good to confirm that other volumes aren't ALSO stuck migrating from/to the "faulty" node, because restarting the kubelet can cause volumes in this state to crash and require rebuilding. (That particular issue is resolved on the Longhorn side in later versions, but I still think it's better to be safe.) If the kubelet is NOT logging repeatedly still and the VM is stopped, I don't know off the top of my head why there should be a rogue VolumeAttachment. Feel free to upload a support bundle or send it to longhorn-support-bundle@suse.com and I should be able to take a quick look sometime today to see if there is an obvious reason related to a known issue (or some new issue I'm not aware of),

sticky-summer-13450

07/22/2024, 4:37 PM

Thanks. I'm glad I checked. The message is only logged once in one kubelet log. It is not continuing to be logged. I'll make a support bundle. Thank you.

faint-sunset-36608

07/22/2024, 7:09 PM

Are you still experiencing the issue? I see some things in there from back on July 6 - July 8, but no evidence of any rogue VolumeAttachments for

pvc-86efbb6c-de2f-4a25-b2b6-603a3f2fa45f

today. Maybe I was just too late in my response to catch the negative effects?

sticky-summer-13450

07/22/2024, 8:31 PM

When the VM is turned off the volume shows as detached with a spinny and this upgrade message.

sticky-summer-13450

07/22/2024, 8:32 PM

When the VM is running the volume shows as degraded.

faint-sunset-36608

07/22/2024, 8:33 PM

Is there an ongoing rebuild?

sticky-summer-13450

07/22/2024, 8:34 PM

No - no rebuilding. The volume detail is still like the first image in this thread...

faint-sunset-36608

07/22/2024, 8:35 PM

This looks quite strange. Can you capture a support bundle in this attached state?

sticky-summer-13450

07/22/2024, 8:35 PM

sure

faint-sunset-36608

07/22/2024, 8:40 PM

Since your

concurrent-automatic-engine-upgrade-per-node-limit == 3

, it may be the case that the rebuild is waiting on the engine upgrade. (It looks like this volume engine is still using

v1.4.3

.) Hopefully the support bundle indicates why the upgrade is blocked (if that is what's going on).

faint-sunset-36608

07/22/2024, 9:11 PM

Do you have any other volumes acting like this?

sticky-summer-13450

07/22/2024, 9:16 PM

No. Just that volume.

faint-sunset-36608

07/22/2024, 9:26 PM

I cannot understand why yet, but the immediate problem seems to be the fact that Longhorn is tracking two engine CRs for the same volume. One is active, one is not. Normally, it should clean up the non-active one, but it appears not to be able to for now. I will need to do a deep dive in the code to try to understand why. My immediate guess is a bad interaction between migration and upgrade caused this, similar to https://github.com/longhorn/longhorn/issues/7833, but with a different outcome. The two existing replicas both reference

pvc-86efbb6c-de2f-4a25-b2b6-603a3f2fa45f-e-72b58d18

, which Longhorn also thinks is the active one. So it seems to me the correct mitigation is to: 1. Turn the VM off again and wait for the volume to detach. 2. Try to

kubectl delete -n longhorn-system pvc-86efbb6c-de2f-4a25-b2b6-603a3f2fa45f-e-f5af50fa

. This is NOT the active one and not the one referenced by the existing replicas. 3. Use

kubectl get -n longhorn-system engine pvc-86efbb6c-de2f-4a25-b2b6-603a3f2fa45f-e-f5af50fa

to confirm the engine is fully removed. 4. Turn the VM back on again. I won't be able to get to it today, but I'll try to do an RCA on this situation to understand how it happened. It will be very helpful if you can provide some additional details: 1. What version of Harvester was originally installed. 2. When did you upgrade? 3. What version of Harvester did you upgrade to? Did you go through interim versions? 4. Was the volume usable between the upgrade and now?

sticky-summer-13450

07/22/2024, 9:33 PM

I'll give full answers to your questions tomorrow UK time. This Harvester cluster has been through most versions, it's over 912 days old.

👍 1

sticky-summer-13450

07/23/2024, 7:06 AM

Answers: 1. Although I ran v0.1.0, v0.2.0 & v0.3.0 I think none of them were upgradable to v1, so I believe this cluster has been: v1.0.0 → v1.0.1 → v1.0.2 → v1.0.3 → v1.1.1 → v1.1.2 → v1.2.0 → v1.2.1 → v1.2.2 → v1.3.1 (according to the data in harvester.io/Upgrade) 2. upgrades a. v1.2.2 → v1.3.1 2024-07-06T124646Z b. v1.2.1 → v1.2.2 2024-06-14T200954Z c. v1.2.0 → v1.2.1 2024-06-14T142709Z d. v1.1.2 → v1.2.0 2023-10-02T073121Z e. v1.1.1 → v1.1.2 2023-04-27T175601Z f. v1.0.3 → v1.1.1 2022-12-04T104921Z g. v1.0.2 → v1.0.3 2022-09-02T180451Z h. v1.0.1 → v1.0.2 2022-05-29T180057Z i. v1.0.0 → v1.0.1 2022-05-29T103908Z 3. I think all the version details are above 4. I only noticed the volume was in this state soon after the upgrade to v1.3.1, it's possible it was in this state before though - I don't look at the Longhorn dashboard very often, not since upgrading to v1.2.0 when the Longhorn in Harvester became a lot more stable than in the v1.1 series of Harvester releases.

sticky-summer-13450

07/23/2024, 7:42 AM

I followed your instructions and after turning the VM back on Longhorn did what it needed to, created the missing replica and that volume is now healthy. Thank you so much for your help and advice. I hope my support bundles and information helps make Longhorn even better 🙂

faint-sunset-36608

07/23/2024, 3:11 PM

Thanks for the information and the interactive troubleshooting! I'll analyze the support bundle more and see if I can find anything interesting. Glad to hear the situation is resolved!

🙌 1

56 Views

Open in Slack

Previous Next