This message was deleted Rancher Users #longhorn-storage

Join Slack

This message was deleted.

# longhorn-storage

adamant-kite-43734

11/04/2024, 10:33 AM

This message was deleted.

icy-agency-38675

11/04/2024, 4:32 PM

I manually imported the CRD through the
kubectl
command line

Can you elaborate on the

manually imported ...

? What are your steps to upgrade?

cool-architect-86201

11/05/2024, 2:06 AM

I upgraded from version 1.6.0 to 1.6.3 through the Rancher App Market. The Helm logs indicated that the installation was successful, but when I checked multiple longhorn-system pods, there were errors showing that CRD resources were missing. So, I found the corresponding CRD resources on GitHub and applied them manually. https://github.com/longhorn/longhorn/blob/v1.6.3/deploy/longhorn.yaml (just apply crd ) Although the installation succeeded, none of my pods are running properly. When I checked the PVC and Longhorn management dashboard, they are all in a 'deleting' state. @icy-agency-38675

cool-architect-86201

11/05/2024, 2:07 AM

Will my data be lost?

icy-agency-38675

11/05/2024, 2:08 AM

Sounds weird. Both v1.6.0 and v1.6.3 are installed by Rancher App?

cool-architect-86201

11/05/2024, 2:09 AM

Yes

icy-agency-38675

11/05/2024, 2:10 AM

there were errors showing that CRD resources were missing.

What are the errors?

cool-architect-86201

11/05/2024, 2:12 AM

Another observation: By default, Longhorn’s storage location is set to

/var/lib/longhorn

. However, in version 1.6.0, I manually set it to

/data

. After the upgrade, the storage location on all worker nodes reverted to

/var/lib/longhorn

, so I had to manually change each one back to

/data

. When Longhorn was upgraded from 1.6.0 to 1.6.3, it appears that the backend configuration files may have been lost as well—this is just my assumption, though.

icy-agency-38675

11/05/2024, 2:13 AM

Got it. Need to check if this is the culprit

icy-agency-38675

11/05/2024, 2:14 AM

You can rescue the data by https://longhorn.io/docs/1.7.2/advanced-resources/data-recovery/export-from-replica/

cool-architect-86201

11/05/2024, 2:23 AM

We have a large number of PVCs in our cluster, exceeding 100TB. Restoring them one by one would take too long. Is there any other solution for a faster rollback? Even a downgrade would be acceptable.

icy-agency-38675

11/05/2024, 2:24 AM

By default, Longhorn’s storage location is set to
/var/lib/longhorn
. However, in version 1.6.0, I manually set it to
/data
. After the upgrade, the storage location on all worker nodes reverted to
/var/lib/longhorn
, so I had to manually change each one back to
/data
.

What's setting name you changed during upgrade?

cool-architect-86201

11/05/2024, 2:27 AM

I forgot what the previous name was, but I’m certain that the two are definitely different, because the newly upgraded name is left blank.

icy-agency-38675

11/05/2024, 2:31 AM

Can you provide • the overall steps of the upgrade you did and how to trigger it? • support bundle I'm currently a bit confused.

cool-architect-86201

11/05/2024, 2:48 AM

The first time I deployed Longhorn was about a year ago, using the Longhorn Rancher chart. At that time, I didn’t modify the default directory location, which was set to

/var/lib/longhorn

, but I manually changed each one to

/data

through the UI. Yesterday, I decided to upgrade to 1.6.3 because I noticed some bugs and saw in the issues that these were fixed in later versions. I upgraded through the Rancher app without changing any values. Helm logs showed that the installation was successful. However, Longhorn wasn’t functioning properly, and when I checked the Longhorn backend, I found that many resources couldn’t be retrieved due to missing CRDs. I tried upgrading through the Longhorn Rancher chart again, this time to version 1.7.2 (thinking that version 1.6.3 might not have included these CRD resources). At this point, the Rancher terminal log notified me that the upgrade failed because CRDs were missing. So I manually found these CRD resources on GitHub and applied them with

kubectl apply

. Version 1.6.3 is now working, but the data appears to be lost, as the previous pods are no longer functioning.

icy-agency-38675

11/05/2024, 2:52 AM

Can you provide a support bundle? Then, I can check the status of the CRs.

cool-architect-86201

11/05/2024, 3:01 AM

Generating bundle is stuck 83%

cool-architect-86201

11/05/2024, 3:02 AM

Copy code

024-11-05T02:57:00.317687075Z time="2024-11-05T02:57:00Z" level=debug msg="Complete node web3-staking-tokyo-01"
2024-11-05T02:57:00.330030773Z time="2024-11-05T02:57:00Z" level=debug msg="Handle create node bundle for rke-worker-01"
2024-11-05T02:57:00.344685251Z time="2024-11-05T02:57:00Z" level=debug msg="Complete node rke-worker-01"
2024-11-05T02:57:00.520072770Z time="2024-11-05T02:57:00Z" level=debug msg="Handle create node bundle for rke-worker-06"
2024-11-05T02:57:00.528997170Z time="2024-11-05T02:57:00Z" level=debug msg="Complete node rke-worker-06"
2024-11-05T02:57:00.690127959Z time="2024-11-05T02:57:00Z" level=debug msg="Handle create node bundle for rke-worker-08"
2024-11-05T02:57:00.701616014Z time="2024-11-05T02:57:00Z" level=debug msg="Complete node rke-worker-08"
2024-11-05T02:57:00.777495986Z time="2024-11-05T02:57:00Z" level=debug msg="Handle create node bundle for rke-worker-07"
2024-11-05T02:57:00.793919777Z time="2024-11-05T02:57:00Z" level=debug msg="Complete node rke-worker-07"
2024-11-05T02:57:01.201829858Z time="2024-11-05T02:57:01Z" level=debug msg="Handle create node bundle for rke-worker-02"
2024-11-05T02:57:01.222568871Z time="2024-11-05T02:57:01Z" level=debug msg="Complete node rke-worker-02"
time="2024-11-05T02:57:01Z" level=debug msg="All nodes are completed"
2024-11-05T02:57:01.222657810Z time="2024-11-05T02:57:01Z" level=info msg="All node bundles are received."
time="2024-11-05T02:57:01Z" level=info msg="Succeed to run phase node bundle. Progress (50)."
2024-11-05T02:57:01.230345330Z time="2024-11-05T02:57:01Z" level=info msg="Running phase prometheus bundle"
2024-11-05T02:57:31.241536883Z time="2024-11-05T02:57:31Z" level=error msg="Failed to run phase prometheus bundle: failed to get prometheus alert: Get \"<http://10.42.19.64:9090/api/v1/alerts>\": dial tcp 10.42.19.64:9090: i/o timeout"
time="2024-11-05T02:57:31Z" level=error msg="Failed to run optionalPhases prometheus bundle: failed to get prometheus alert: Get \"<http://10.42.19.64:9090/api/v1/alerts>\": dial tcp 10.42.19.64:9090: i/o timeout"
2024-11-05T02:57:31.241592992Z time="2024-11-05T02:57:31Z" level=info msg="Running phase package"
2024-11-05T02:57:32.515724780Z time="2024-11-05T02:57:32Z" level=info msg="Succeed to run phase package. Progress (66)."
2024-11-05T02:57:32.515761629Z time="2024-11-05T02:57:32Z" level=info msg="Running phase done"
2024-11-05T02:57:32.515773419Z time="2024-11-05T02:57:32Z" level=info msg="Support bundle /tmp/support-bundle-kit/supportbundle_35290e10-63fd-4cca-a3e6-fe9c629b49a6_2024-11-05T02-56-05Z.zip ready to download"
2024-11-05T02:57:32.515780609Z time="2024-11-05T02:57:32Z" level=info msg="Succeed to run phase done. Progress (83)."

icy-agency-38675

11/05/2024, 3:11 AM

weird. Probably can dump all longhorn CRs and logs of longhorn-manager pods and longhorn-instance-manager pods. Zip them and upload it here

cool-architect-86201

11/05/2024, 3:24 AM

Oops, I messed up. I found that the replicas under /data/replicas have been cleaned.

icy-agency-38675

11/05/2024, 3:25 AM

I'm sorry for hearing it...

28 Views

Open in Slack

Previous Next