This message was deleted.
# longhorn-storage
a
This message was deleted.
i
I manually imported the CRD through the
kubectl
command line
Can you elaborate on the
manually imported ...
? What are your steps to upgrade?
c
I upgraded from version 1.6.0 to 1.6.3 through the Rancher App Market. The Helm logs indicated that the installation was successful, but when I checked multiple longhorn-system pods, there were errors showing that CRD resources were missing. So, I found the corresponding CRD resources on GitHub and applied them manually. https://github.com/longhorn/longhorn/blob/v1.6.3/deploy/longhorn.yaml (just apply crd ) Although the installation succeeded, none of my pods are running properly. When I checked the PVC and Longhorn management dashboard, they are all in a 'deleting' state. @icy-agency-38675
Will my data be lost?
i
Sounds weird. Both v1.6.0 and v1.6.3 are installed by Rancher App?
c
Yes
i
there were errors showing that CRD resources were missing.
What are the errors?
c
Another observation: By default, Longhorn’s storage location is set to
/var/lib/longhorn
. However, in version 1.6.0, I manually set it to
/data
. After the upgrade, the storage location on all worker nodes reverted to
/var/lib/longhorn
, so I had to manually change each one back to
/data
. When Longhorn was upgraded from 1.6.0 to 1.6.3, it appears that the backend configuration files may have been lost as well—this is just my assumption, though.
i
Got it. Need to check if this is the culprit
c
We have a large number of PVCs in our cluster, exceeding 100TB. Restoring them one by one would take too long. Is there any other solution for a faster rollback? Even a downgrade would be acceptable.
i
By default, Longhorn’s storage location is set to
/var/lib/longhorn
. However, in version 1.6.0, I manually set it to
/data
. After the upgrade, the storage location on all worker nodes reverted to
/var/lib/longhorn
, so I had to manually change each one back to
/data
.
What's setting name you changed during upgrade?
c
I forgot what the previous name was, but I’m certain that the two are definitely different, because the newly upgraded name is left blank.
i
Can you provide • the overall steps of the upgrade you did and how to trigger it? • support bundle I'm currently a bit confused.
c
The first time I deployed Longhorn was about a year ago, using the Longhorn Rancher chart. At that time, I didn’t modify the default directory location, which was set to
/var/lib/longhorn
, but I manually changed each one to
/data
through the UI. Yesterday, I decided to upgrade to 1.6.3 because I noticed some bugs and saw in the issues that these were fixed in later versions. I upgraded through the Rancher app without changing any values. Helm logs showed that the installation was successful. However, Longhorn wasn’t functioning properly, and when I checked the Longhorn backend, I found that many resources couldn’t be retrieved due to missing CRDs. I tried upgrading through the Longhorn Rancher chart again, this time to version 1.7.2 (thinking that version 1.6.3 might not have included these CRD resources). At this point, the Rancher terminal log notified me that the upgrade failed because CRDs were missing. So I manually found these CRD resources on GitHub and applied them with
kubectl apply
. Version 1.6.3 is now working, but the data appears to be lost, as the previous pods are no longer functioning.
i
Can you provide a support bundle? Then, I can check the status of the CRs.
c
Generating bundle is stuck 83%
Copy code
024-11-05T02:57:00.317687075Z time="2024-11-05T02:57:00Z" level=debug msg="Complete node web3-staking-tokyo-01"
2024-11-05T02:57:00.330030773Z time="2024-11-05T02:57:00Z" level=debug msg="Handle create node bundle for rke-worker-01"
2024-11-05T02:57:00.344685251Z time="2024-11-05T02:57:00Z" level=debug msg="Complete node rke-worker-01"
2024-11-05T02:57:00.520072770Z time="2024-11-05T02:57:00Z" level=debug msg="Handle create node bundle for rke-worker-06"
2024-11-05T02:57:00.528997170Z time="2024-11-05T02:57:00Z" level=debug msg="Complete node rke-worker-06"
2024-11-05T02:57:00.690127959Z time="2024-11-05T02:57:00Z" level=debug msg="Handle create node bundle for rke-worker-08"
2024-11-05T02:57:00.701616014Z time="2024-11-05T02:57:00Z" level=debug msg="Complete node rke-worker-08"
2024-11-05T02:57:00.777495986Z time="2024-11-05T02:57:00Z" level=debug msg="Handle create node bundle for rke-worker-07"
2024-11-05T02:57:00.793919777Z time="2024-11-05T02:57:00Z" level=debug msg="Complete node rke-worker-07"
2024-11-05T02:57:01.201829858Z time="2024-11-05T02:57:01Z" level=debug msg="Handle create node bundle for rke-worker-02"
2024-11-05T02:57:01.222568871Z time="2024-11-05T02:57:01Z" level=debug msg="Complete node rke-worker-02"
time="2024-11-05T02:57:01Z" level=debug msg="All nodes are completed"
2024-11-05T02:57:01.222657810Z time="2024-11-05T02:57:01Z" level=info msg="All node bundles are received."
time="2024-11-05T02:57:01Z" level=info msg="Succeed to run phase node bundle. Progress (50)."
2024-11-05T02:57:01.230345330Z time="2024-11-05T02:57:01Z" level=info msg="Running phase prometheus bundle"
2024-11-05T02:57:31.241536883Z time="2024-11-05T02:57:31Z" level=error msg="Failed to run phase prometheus bundle: failed to get prometheus alert: Get \"<http://10.42.19.64:9090/api/v1/alerts>\": dial tcp 10.42.19.64:9090: i/o timeout"
time="2024-11-05T02:57:31Z" level=error msg="Failed to run optionalPhases prometheus bundle: failed to get prometheus alert: Get \"<http://10.42.19.64:9090/api/v1/alerts>\": dial tcp 10.42.19.64:9090: i/o timeout"
2024-11-05T02:57:31.241592992Z time="2024-11-05T02:57:31Z" level=info msg="Running phase package"
2024-11-05T02:57:32.515724780Z time="2024-11-05T02:57:32Z" level=info msg="Succeed to run phase package. Progress (66)."
2024-11-05T02:57:32.515761629Z time="2024-11-05T02:57:32Z" level=info msg="Running phase done"
2024-11-05T02:57:32.515773419Z time="2024-11-05T02:57:32Z" level=info msg="Support bundle /tmp/support-bundle-kit/supportbundle_35290e10-63fd-4cca-a3e6-fe9c629b49a6_2024-11-05T02-56-05Z.zip ready to download"
2024-11-05T02:57:32.515780609Z time="2024-11-05T02:57:32Z" level=info msg="Succeed to run phase done. Progress (83)."
i
weird. Probably can dump all longhorn CRs and logs of longhorn-manager pods and longhorn-instance-manager pods. Zip them and upload it here
c
Oops, I messed up. I found that the replicas under /data/replicas have been cleaned.
i
I'm sorry for hearing it...