Hello! I noticed that my backups aren't going thro...
# harvester
p
Hello! I noticed that my backups aren't going through anymore, with error:
VolumeSnapshot svmb-<...>-<server name>disk-1 in error state
I found this: https://github.com/harvester/harvester/issues/4361 which is closed with "bucket configuration" And then this: https://github.com/harvester/harvester/issues/5841 which is still open but doesn't have a fix Running 1.4.0 (which I should update, but I don't want to update while my VMs are still having issues being backed up)
Yesterday I had that error on some 5 VMs. As suggested in the second issue, I restarted the csi-snapshotter deployment
kubectl -n longhorn-system rollout restart deployment csi-snapshotter
and it sort of worked. Overnight, 3 VMs got backed up properly, while 2 still have that error.
However, I can create a snapshot perfectly fine. These are all Windows VMs. Admittedly, I don't have any Linux VMs being backed up. I started one up, for testing purposes.
Backing up a Linux VM went through fine. Though that doesn't say much as some Windows VMs pass through and a couple don't -_-
I don't know if it's related but I'm scaling up the replicas on Longhorn and while one volume is able to rebuild the extra replica fine, another volume's rebuild is stuck on 0%
generate a support bundle - gets stuck at 16% and times out ._. i stand corrected, it was just taking its time
I don't know if I have two problems or both are related. On the one hand, VolumeSnapshot is in "error state" (all replicas are healthy) On the other hand, rebuilding replicas seems either very slow or very much not working
And if the suggestion is to restart a longhorn deployment of sorts, I restarted every deployment in the longhorn-system namespace and I still get the same error. Maybe I should move on to the harvester-system ns
m
Which CRs are related to this? I mean the corresponding resource names for VMBackup, VolumeSnapshot, VolumeSnapshotContent, Longhorn Snapshot, and Longhorn Backup CRs.
p
Oh, you gave me a good place to look.
k describe <http://virtualmachinebackups.harvesterhci.io|virtualmachinebackups.harvesterhci.io>
on one of the affected backups I attempted manually. I found:
Copy code
Volume Backups:
    Creation Time:    1970-01-01T00:00:00Z
    Csi Driver Name:  <http://driver.longhorn.io|driver.longhorn.io>
    Error:
      Message:             Failed to check and update snapshot content: failed to take snapshot of the volume pvc-5bd3aeca-91da-4426-8a31-d10268cb195e: "rpc error: code = Internal desc = waitForSnapshotToBeReady: timeout while waiting for snapshot snapshot-37ba210a-857b-4d39-8956-12b2cf127c7d to be ready"
However, in the Longhorn UI, it shows that the snapshot was successful
m
Maybe the LH snapshot takes longer than the CSI timeout, and the related Volume is busy? like Heavy I/O or rebuilding?
Can you check the related volumesnpashot?
p
That kinda makes sense, yeah. On the one hand, I am running Harvester on servers with hard disks and not SSD storage. So I would understand if I/O and access speeds were a limiting factor. Maybe a way to increase the timeout to maybe a second?
m
the CSI timeout can be changed at runtime from my understanding, it's something internal in LH implementation
p
Copy code
harvester-2:/home/rancher # k get <http://volumesnapshots.snapshot.storage.k8s.io|volumesnapshots.snapshot.storage.k8s.io> -n adbookings-servers | grep a449235d-a597-4970-b56c-cbbdf7070e07
svmb-<long string>-volume-adbookings-application-server-disk-2-i2bxf   false        adbookings-application-server-disk-2-i2bxf                                         longhorn            snapcontent-a449235d-a597-4970-b56c-cbbdf7070e07                  32h
So the volumesnapshot is not ready to use
Oh, I accidentally took it from a different VM than the one I got the virtualmachinebackups from. Either way, both have the same error.
m
If you’ve confirmed that the Longhorn snapshot associated with the VolumeBackup has been successfully created, you can try performing a rollout restart of the CSI snapshotter.
p
I just recreated a backup to be sure because right now nothing is happening on the cluster; no rebuilding or anything which could cause a timeout. The snapshot for disk-2 hasn't appeared in the Longhorn UI. However, the progress keeps increasing despite the error. If I check in every so often, the percentage has gone up.
The snapshot for disk-1 does appear in Longhorn UI and does take a little bit of time around 3 minutes. Eventually it does go through though.
m
If you’ve confirmed that the Longhorn snapshot associated with the VolumeBackup has been successfully created, you can try performing a rollout restart of the CSI snapshotter.
Did yoy try this? And can you generate the latest SB?
p
The snapshot resource exists for disk-2 (the one in error state) but is not ready and doesn't even appear in the Longhorn UI. So I wouldn't call it "successfully created" but I'll restart the CSI snapshotter deployment and run a new backup, just in case
The SB is generating right now
Affected VM: adbookings-application-server Namespace: adbookings-servers Backup: test-backup-adbookings-app
supportbundle_afbe44ef-9329-41ce-85ea-3951ec79ce18_2025-05-13T07-54-59Z.zip
Restarted the csi snapshotter deployment - and now both disk-1 and disk-2 have the same issue.
Copy code
Message:     Failed to create snapshot: failed to take snapshot of the volume pvc-2fdf0972-11ca-4564-8e05-3bee84c20dbf: "rpc error: code = DeadlineExceeded desc = waitForBackupControllerSync: timeout while waiting for backup controller to sync for volume pvc-2fdf0972-11ca-4564-8e05-3bee84c20dbf and snapshot snapshot-c3b1d58f-a779-4ac4-b3a9-78161fc8b0b0"
Copy code
svmb-2386ca1f-87e9-4518-bbdb-8a8605bc7f42-20250508.2320-volume-adbookings-application-server-disk-2-i2bxf   true         adbookings-application-server-disk-2-i2bxf                           2Mi           longhorn            snapcontent-03c283d7-bf9c-4e9b-80b8-af7e587ee38d   55y            4d8h
Not sure how related, but in a backup I did (which worked) the age of the volume is shown as... 55 years? This also shows up in a failed volumesnapshot.