Hello I noticed that my backups aren t going through anymore Rancher Users #harvester

Hello! I noticed that my backups aren't going thro...

powerful-easter-15334

05/09/2025, 7:23 AM

Hello! I noticed that my backups aren't going through anymore, with error:

VolumeSnapshot svmb-<...>-<server name>disk-1 in error state

I found this: https://github.com/harvester/harvester/issues/4361 which is closed with "bucket configuration" And then this: https://github.com/harvester/harvester/issues/5841 which is still open but doesn't have a fix Running 1.4.0 (which I should update, but I don't want to update while my VMs are still having issues being backed up)

powerful-easter-15334

05/09/2025, 7:25 AM

Yesterday I had that error on some 5 VMs. As suggested in the second issue, I restarted the csi-snapshotter deployment

kubectl -n longhorn-system rollout restart deployment csi-snapshotter

and it sort of worked. Overnight, 3 VMs got backed up properly, while 2 still have that error.

powerful-easter-15334

05/09/2025, 7:29 AM

However, I can create a snapshot perfectly fine. These are all Windows VMs. Admittedly, I don't have any Linux VMs being backed up. I started one up, for testing purposes.

powerful-easter-15334

05/09/2025, 7:45 PM

Backing up a Linux VM went through fine. Though that doesn't say much as some Windows VMs pass through and a couple don't -_-

powerful-easter-15334

05/09/2025, 7:46 PM

I don't know if it's related but I'm scaling up the replicas on Longhorn and while one volume is able to rebuild the extra replica fine, another volume's rebuild is stuck on 0%

powerful-easter-15334

05/09/2025, 7:51 PM

~~generate a support bundle - gets stuck at 16% and times out ._.~~ i stand corrected, it was just taking its time

powerful-easter-15334

05/09/2025, 7:58 PM

support bundle:

supportbundle_afbe44ef-9329-41ce-85ea-3951ec79ce18_2025-05-09T19-49-26Z.zip

powerful-easter-15334

05/09/2025, 8:01 PM

I don't know if I have two problems or both are related. On the one hand, VolumeSnapshot is in "error state" (all replicas are healthy) On the other hand, rebuilding replicas seems either very slow or very much not working

powerful-easter-15334

05/09/2025, 8:25 PM

And if the suggestion is to restart a longhorn deployment of sorts, I restarted every deployment in the longhorn-system namespace and I still get the same error. Maybe I should move on to the harvester-system ns

magnificent-pencil-261

05/13/2025, 7:25 AM

Which CRs are related to this? I mean the corresponding resource names for VMBackup, VolumeSnapshot, VolumeSnapshotContent, Longhorn Snapshot, and Longhorn Backup CRs.

powerful-easter-15334

05/13/2025, 7:31 AM

Oh, you gave me a good place to look.

k describe <http://virtualmachinebackups.harvesterhci.io|virtualmachinebackups.harvesterhci.io>

on one of the affected backups I attempted manually. I found:

Copy code

Volume Backups:
    Creation Time:    1970-01-01T00:00:00Z
    Csi Driver Name:  <http://driver.longhorn.io|driver.longhorn.io>
    Error:
      Message:             Failed to check and update snapshot content: failed to take snapshot of the volume pvc-5bd3aeca-91da-4426-8a31-d10268cb195e: "rpc error: code = Internal desc = waitForSnapshotToBeReady: timeout while waiting for snapshot snapshot-37ba210a-857b-4d39-8956-12b2cf127c7d to be ready"

powerful-easter-15334

05/13/2025, 7:34 AM

However, in the Longhorn UI, it shows that the snapshot was successful

magnificent-pencil-261

05/13/2025, 7:36 AM

Maybe the LH snapshot takes longer than the CSI timeout, and the related Volume is busy? like Heavy I/O or rebuilding?

magnificent-pencil-261

05/13/2025, 7:36 AM

Can you check the related volumesnpashot?

powerful-easter-15334

05/13/2025, 7:37 AM

That kinda makes sense, yeah. On the one hand, I am running Harvester on servers with hard disks and not SSD storage. So I would understand if I/O and access speeds were a limiting factor. Maybe a way to increase the timeout to maybe a second?

magnificent-pencil-261

05/13/2025, 7:38 AM

the CSI timeout can be changed at runtime from my understanding, it's something internal in LH implementation

powerful-easter-15334

05/13/2025, 7:41 AM

Copy code

harvester-2:/home/rancher # k get <http://volumesnapshots.snapshot.storage.k8s.io|volumesnapshots.snapshot.storage.k8s.io> -n adbookings-servers | grep a449235d-a597-4970-b56c-cbbdf7070e07
svmb-<long string>-volume-adbookings-application-server-disk-2-i2bxf   false        adbookings-application-server-disk-2-i2bxf                                         longhorn            snapcontent-a449235d-a597-4970-b56c-cbbdf7070e07                  32h

So the volumesnapshot is not ready to use

powerful-easter-15334

05/13/2025, 7:43 AM

Oh, I accidentally took it from a different VM than the one I got the virtualmachinebackups from. Either way, both have the same error.

magnificent-pencil-261

05/13/2025, 7:47 AM

If you’ve confirmed that the Longhorn snapshot associated with the VolumeBackup has been successfully created, you can try performing a rollout restart of the CSI snapshotter.

powerful-easter-15334

05/13/2025, 7:51 AM

I just recreated a backup to be sure because right now nothing is happening on the cluster; no rebuilding or anything which could cause a timeout. The snapshot for disk-2 hasn't appeared in the Longhorn UI. However, the progress keeps increasing despite the error. If I check in every so often, the percentage has gone up.

powerful-easter-15334

05/13/2025, 7:51 AM

The snapshot for disk-1 does appear in Longhorn UI and does take a little bit of time around 3 minutes. Eventually it does go through though.

magnificent-pencil-261

05/13/2025, 7:54 AM

If you’ve confirmed that the Longhorn snapshot associated with the VolumeBackup has been successfully created, you can try performing a rollout restart of the CSI snapshotter.

Did yoy try this? And can you generate the latest SB?

powerful-easter-15334

05/13/2025, 7:57 AM

The snapshot resource exists for disk-2 (the one in error state) but is not ready and doesn't even appear in the Longhorn UI. So I wouldn't call it "successfully created" but I'll restart the CSI snapshotter deployment and run a new backup, just in case

powerful-easter-15334

05/13/2025, 7:57 AM

The SB is generating right now

powerful-easter-15334

05/13/2025, 8:05 AM

Affected VM: adbookings-application-server Namespace: adbookings-servers Backup: test-backup-adbookings-app

powerful-easter-15334

05/13/2025, 8:06 AM

supportbundle_afbe44ef-9329-41ce-85ea-3951ec79ce18_2025-05-13T07-54-59Z.zip

powerful-easter-15334

05/13/2025, 8:18 AM

Restarted the csi snapshotter deployment - and now both disk-1 and disk-2 have the same issue.

Copy code

Message:     Failed to create snapshot: failed to take snapshot of the volume pvc-2fdf0972-11ca-4564-8e05-3bee84c20dbf: "rpc error: code = DeadlineExceeded desc = waitForBackupControllerSync: timeout while waiting for backup controller to sync for volume pvc-2fdf0972-11ca-4564-8e05-3bee84c20dbf and snapshot snapshot-c3b1d58f-a779-4ac4-b3a9-78161fc8b0b0"

powerful-easter-15334

05/13/2025, 8:20 AM

Copy code

svmb-2386ca1f-87e9-4518-bbdb-8a8605bc7f42-20250508.2320-volume-adbookings-application-server-disk-2-i2bxf   true         adbookings-application-server-disk-2-i2bxf                           2Mi           longhorn            snapcontent-03c283d7-bf9c-4e9b-80b8-af7e587ee38d   55y            4d8h

Not sure how related, but in a backup I did (which worked) the age of the volume is shown as... 55 years? This also shows up in a failed volumesnapshot.

brainy-kilobyte-33711

08/05/2025, 7:47 AM

Hey - did you ever get to the bottom of this. We see this fairly consistently on initial backups of large disks or VMs with lots of disks

powerful-easter-15334

08/05/2025, 7:51 AM

Hello! I wish I had a better solution for you, but I basically spread out the backup schedules, and increased the

retain

and

max failure

values. Initially they were the minimum of 3,2 and now on some VMs, I brought it to 6,5. I do not exactly understand why but I guess there may be something related to the slowness of the backup. In my case, the connection to S3 is slow.

brainy-kilobyte-33711

08/05/2025, 7:56 AM

No problem and we will likely do the same, the NFS we are using is slow and I guess we are also hitting timeouts which would tie up with it working fine on subsequent backups as those are incremental.

brainy-kilobyte-33711

08/05/2025, 8:03 AM

I see Webber is in the thread - do you know if the longhorn setting Backup Concurrent Limit Per Backup will affect this? e.g. If we give it more threads is it less likely to encounter timeouts

magnificent-pencil-261

08/05/2025, 8:16 AM

Harvester does not modify this parameter. Based on the source code, it appears to enable more parallel handling of backups, though the actual effect depends on the Longhorn engine's implementation.

brainy-kilobyte-33711

08/05/2025, 8:39 AM

Have raised on https://github.com/longhorn/longhorn/issues/11429 It seems like something is not updated the VolumeSnapshot CRD after the timeout the and the longhorn snapshot CRD becomes readytouse

powerful-easter-15334

08/05/2025, 8:41 AM

https://github.com/harvester/harvester/issues/8252#issuecomment-2986848072 I had this issue which ended because I took a plane the next day and thus forgot to make a support bundle till you re-raised this issue here 😅

powerful-easter-15334

08/05/2025, 8:42 AM

What version of Harvester are you using?

brainy-kilobyte-33711

08/05/2025, 8:42 AM

lol I have commented your ticket on mine, will let the longhorn team decide if they want to close mine and reopen yours or other way around

👍 1

brainy-kilobyte-33711

08/05/2025, 8:43 AM

We are on 1.4.2 currently

powerful-easter-15334

08/05/2025, 8:43 AM

1.4.0 here (still, and an upgrade is overdue) Will upgrade to 1.5.0 soon (fingers crossed) and I'll be interested to see if it persists.

brainy-kilobyte-33711

08/05/2025, 8:53 AM

We are holding off 1.5.x until this is fixed https://github.com/harvester/harvester/issues/8471 Afraid we would hit it and then be unable to roll back

9 Views

Open in Slack

Previous Next