This message was deleted.
# longhorn-storage
a
This message was deleted.
f
Usage space on node 001 is too high
Can you check if any other app eating storage on node 001?
s
The node 001 is only running Harvester v1.0.3 - no other software is running on the node. The space is taken up mostly in the Longhorn Replicas folder.
Copy code
rancher@harvester001:~> sudo du -sh /var
1.4T	/var

rancher@harvester001:~> sudo du -sh /var/lib/longhorn/replicas/
1.2T	/var/lib/longhorn/replicas/
Of the files in
/var/lib/longhorn/replicas/
on node 001, about 100GB start with
backup-of-
and 1.1TB start with
pvc-
On the other two nodes
/var/lib/longhorn/replicas/
has between 400GB and 470GB of files which start with
pvc-
Is it possible that there are orphan files in that folder on node 001 which Longhorn has somehow left and forgotten about?
f
Is it possible that there are orphan files in that folder on node 001 which Longhorn has somehow left and forgotten about?
It is possible. Which Longhorn version is it?
If Longhorn 1.3.x, can you check the orphan replicas in
setting -> orphaned data
?
s
The current version of Harvester comes with v1.2.4 of Longhorn.
f
In this case, we would have to check the replicas gilder manually to see which one is orphaned (Not corresponding to any replica “kubectl get replicas -n longhorn-system”)
s
Currently on the underground on the way into the office... I don't understand 'replicas gilder'. I assume you mean compare the output of that kubectl command against the files in the directory. I can do that, between meetings and development work in the office.
From work it's too difficult to attach files which are the output of
kubectl
etc, but... I have done that
kubectl
command, and grepped/awked out the first column of the replicas on the node 001, and compared that with the list of files on the disk. The two do not match because the format is slightly different, but they also do not match because the suffix number is not the same. For instance, the replica from `kubectl`:
Copy code
pvc-2909b66d-9e7e-4b12-98f8-c94ccfc08357-r-f09879a8
and the similarly named files on node 001:
Copy code
pvc-2909b66d-9e7e-4b12-98f8-c94ccfc08357-2b05ad20
pvc-2909b66d-9e7e-4b12-98f8-c94ccfc08357-407c9cfc
pvc-2909b66d-9e7e-4b12-98f8-c94ccfc08357-600d4f7b
pvc-2909b66d-9e7e-4b12-98f8-c94ccfc08357-620b1017
pvc-2909b66d-9e7e-4b12-98f8-c94ccfc08357-6281e0ba
pvc-2909b66d-9e7e-4b12-98f8-c94ccfc08357-6a5d396e
pvc-2909b66d-9e7e-4b12-98f8-c94ccfc08357-7656b4c9
pvc-2909b66d-9e7e-4b12-98f8-c94ccfc08357-7cba3f42
pvc-2909b66d-9e7e-4b12-98f8-c94ccfc08357-804f169c
pvc-2909b66d-9e7e-4b12-98f8-c94ccfc08357-80ff60c9
pvc-2909b66d-9e7e-4b12-98f8-c94ccfc08357-82ffb4c3
pvc-2909b66d-9e7e-4b12-98f8-c94ccfc08357-8602faf7
pvc-2909b66d-9e7e-4b12-98f8-c94ccfc08357-8d4b495b
pvc-2909b66d-9e7e-4b12-98f8-c94ccfc08357-9c1bfb80
pvc-2909b66d-9e7e-4b12-98f8-c94ccfc08357-a4aee71a
pvc-2909b66d-9e7e-4b12-98f8-c94ccfc08357-adb8e903
pvc-2909b66d-9e7e-4b12-98f8-c94ccfc08357-ce33ffa6
pvc-2909b66d-9e7e-4b12-98f8-c94ccfc08357-d5214e74
pvc-2909b66d-9e7e-4b12-98f8-c94ccfc08357-df773b10
pvc-2909b66d-9e7e-4b12-98f8-c94ccfc08357-eaa947af
pvc-338f5368-e257-4a27-ba60-ace45f0e4501-e13d5a01
There is no
f09879a8
suffix in the file list.
👀 1
f
Oh. sorry. You should: 1.
kubectl -n longhorn-system get <http://replicas.longhorn.io|replicas.longhorn.io> -o jsonpath='{.items[*].spec.dataDirectoryName}'
2. Check the
/var/lib/longhorn/replicas
folder on
node001
to see which directory inside it doesn’t appear on the above list
👀 1
s
There are no directories listed in the above command - only files.
Copy code
for l in $(kubectl -n longhorn-system get <http://replicas.longhorn.io|replicas.longhorn.io> -o jsonpath='{.items[*].spec.dataDirectoryName}' --context=harvester-cluster); do echo $l; done > harvester-cluster-kubectl-replicas-dataDirectoryName.txt
/var/lib/longhorn/replicas/
on node 001 is a flat directory containing no directories, only files.
Honestly, I'm wondering if I should shutdown all of k8s on node 001, delete everything in
/var/lib/longhorn/replicas/
, and start it all up again.
Since Harvester released v1.1.0 yesterday, I just need to get the current system kind-of working so that an upgrade can work.
f
/var/lib/longhorn/replicas/
on node 001 is a flat directory containing no directories, only files.
I think this is not true. From the output you can see a bunch directories. For example, for the
pvc-5557f46b-8410-40d1-b2e9-ce034f5dcb3a
there are a bunch of directories:
Copy code
pvc-5557f46b-8410-40d1-b2e9-ce034f5dcb3a-3d81c994
pvc-5557f46b-8410-40d1-b2e9-ce034f5dcb3a-44d01e56
pvc-5557f46b-8410-40d1-b2e9-ce034f5dcb3a-4cfbc1db
pvc-5557f46b-8410-40d1-b2e9-ce034f5dcb3a-4d9474ec
pvc-5557f46b-8410-40d1-b2e9-ce034f5dcb3a-52b328ed
pvc-5557f46b-8410-40d1-b2e9-ce034f5dcb3a-5a0fdd2a
pvc-5557f46b-8410-40d1-b2e9-ce034f5dcb3a-5fbfd8d9
pvc-5557f46b-8410-40d1-b2e9-ce034f5dcb3a-686f7f6d
pvc-5557f46b-8410-40d1-b2e9-ce034f5dcb3a-7630c133
pvc-5557f46b-8410-40d1-b2e9-ce034f5dcb3a-80d3e184
pvc-5557f46b-8410-40d1-b2e9-ce034f5dcb3a-823113c5
pvc-5557f46b-8410-40d1-b2e9-ce034f5dcb3a-847ab52b
You can check these directories against this active list https://rancher-users.slack.com/files/U96EY1QHZ/F048ARFS4SX/harvester-cluster-kubectl-replicas-datadirectoryname.txt. Then remove the directory that is not in the active list.
s
I think this is not true. From the output you can see a bunch directories.
Oops - that's embarrassing - I'd convinced myself they were files and not directories. <hangs head in shame> 😞 Yes - now I look properly and do a diff between the output of that
kubectl
command and the directory list I can see there are a few (a very small number) that are in both. When I'm fully awake tomorrow I'll do the ls and the
kubectl
at the same instant and remove any folders which exist on disk and are not in the
kubectl
output - checking that their modification timestamp is some time in the past. Thanks!!!
👍 1
Just because I'm paranoid, I'm doing a lot of investigation before I delete. I have some Python on my home development desktop which uses
kubectl
to get the list of expected directories, uses
ssh
to get the directories on disk from the node, and checks whether the directory of disk is expected from the
kubectl
output. total of 731 directories found on disk total of 92 directories expected from kubectl total of 704 directories are not expected None of them have changed since Oct 24th, which was the last time that particular node was restarted, so I feel I can add
rm -rf /var/lib/longhorn/replicas/%s
to the script.
👍 1
f
total of 704 directories are not expected
This matched the huge observed amount of used space
s
This matched the huge observed amount of used space
Yes. Removing the folders, with their contents, removed around 1TB of data from the 2TB disk - and some of the degraded volumes rebuilt. Then I started upgrading Harvester to v1.1.0, bringing Longhorn up to v1.3.2 - where I can see the
Setting -> Orphaned data
option, where I could prune the other orphaned data from the other nodes. There are still some failed volumes. I don't understand why this one cannot be rebuilt from the one healthy copy.
^^^ that has been saying 27% restored for days.
And while I'm here being annoying - do you have any clue why this volume just keeps cycling through attaching and detaching (see animating GIF)? It's stopping one VM in Harvester from starting, and it's stopping a backup of that VM, and it's stopping a conversion from a volume to an image in Harvester. The logs from the instance manager on the node which flashes up for a short time and disappears again says this:
Copy code
[longhorn-instance-manager] time="2022-10-29T15:44:41Z" level=info msg="Process Manager: prepare to create process pvc-af69ea5b-9798-443a-8527-583c5fd35b70-r-4ada2db2"
[longhorn-instance-manager] time="2022-10-29T15:44:41Z" level=debug msg="Process Manager: validate process path: /host/var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-v1.3.2/longhorn dir: /host/var/lib/longhorn/engine-binaries/ image: longhornio-longhorn-engine-v1.3.2 binary: longhorn"
[longhorn-instance-manager] time="2022-10-29T15:44:41Z" level=info msg="Process Manager: created process pvc-af69ea5b-9798-443a-8527-583c5fd35b70-r-4ada2db2"
[pvc-af69ea5b-9798-443a-8527-583c5fd35b70-r-4ada2db2] time="2022-10-29T15:44:41Z" level=info msg="Creating volume /host/var/lib/longhorn/replicas/pvc-af69ea5b-9798-443a-8527-583c5fd35b70-4280b053, size 10737418240/512"
[pvc-af69ea5b-9798-443a-8527-583c5fd35b70-r-4ada2db2] time="2022-10-29T15:44:41Z" level=info msg="Starting to create disk" disk=000
time="2022-10-29T15:44:41Z" level=info msg="Finished creating disk" disk=000
[pvc-af69ea5b-9798-443a-8527-583c5fd35b70-r-4ada2db2] time="2022-10-29T15:44:41Z" level=info msg="Listening on sync agent server 0.0.0.0:10182"
time="2022-10-29T15:44:41Z" level=info msg="Listening on gRPC Replica server 0.0.0.0:10180"
time="2022-10-29T15:44:41Z" level=info msg="Listening on data server 0.0.0.0:10181"
[pvc-af69ea5b-9798-443a-8527-583c5fd35b70-r-4ada2db2] time="2022-10-29T15:44:41Z" level=info msg="Listening on sync 0.0.0.0:10182"
[longhorn-instance-manager] time="2022-10-29T15:44:41Z" level=info msg="Process pvc-af69ea5b-9798-443a-8527-583c5fd35b70-r-4ada2db2 has started at localhost:10180"
[longhorn-instance-manager] time="2022-10-29T15:44:46Z" level=debug msg="Process Manager: prepare to delete process pvc-af69ea5b-9798-443a-8527-583c5fd35b70-r-4ada2db2"
[longhorn-instance-manager] time="2022-10-29T15:44:46Z" level=debug msg="Process Manager: deleted process pvc-af69ea5b-9798-443a-8527-583c5fd35b70-r-4ada2db2"
[longhorn-instance-manager] time="2022-10-29T15:44:46Z" level=debug msg="Process Manager: trying to stop process pvc-af69ea5b-9798-443a-8527-583c5fd35b70-r-4ada2db2"
[longhorn-instance-manager] time="2022-10-29T15:44:46Z" level=info msg="wait for process pvc-af69ea5b-9798-443a-8527-583c5fd35b70-r-4ada2db2 to shutdown"
[longhorn-instance-manager] time="2022-10-29T15:44:46Z" level=debug msg="Process Manager: wait for process pvc-af69ea5b-9798-443a-8527-583c5fd35b70-r-4ada2db2 to shutdown before unregistering process"
[pvc-af69ea5b-9798-443a-8527-583c5fd35b70-r-4ada2db2] time="2022-10-29T15:44:46Z" level=warning msg="Received signal interrupt to shutdown"
[pvc-af69ea5b-9798-443a-8527-583c5fd35b70-r-4ada2db2] time="2022-10-29T15:44:46Z" level=warning msg="Starting to execute registered shutdown func <http://github.com/longhorn/longhorn-engine/app/cmd.startReplica.func4|github.com/longhorn/longhorn-engine/app/cmd.startReplica.func4>"
[longhorn-instance-manager] time="2022-10-29T15:44:46Z" level=info msg="Process Manager: process pvc-af69ea5b-9798-443a-8527-583c5fd35b70-r-4ada2db2 stopped"
[longhorn-instance-manager] time="2022-10-29T15:44:46Z" level=debug msg="Process Manager: prepare to delete process pvc-af69ea5b-9798-443a-8527-583c5fd35b70-r-4ada2db2"
[longhorn-instance-manager] time="2022-10-29T15:44:46Z" level=debug msg="Process Manager: deleted process pvc-af69ea5b-9798-443a-8527-583c5fd35b70-r-4ada2db2"
[longhorn-instance-manager] time="2022-10-29T15:44:46Z" level=info msg="Process Manager: successfully unregistered process pvc-af69ea5b-9798-443a-8527-583c5fd35b70-r-4ada2db2"
[longhorn-instance-manager] time="2022-10-29T15:44:46Z" level=info msg="Process Manager: successfully unregistered process pvc-af69ea5b-9798-443a-8527-583c5fd35b70-r-4ada2db2"
[longhorn-instance-manager] time="2022-10-29T15:44:46Z" level=debug msg="Process Manager: prepare to delete process pvc-af69ea5b-9798-443a-8527-583c5fd35b70-r-4ada2db2"
f
We would need a full support bundle to understand why the volume stuck in AD loop. You can generate it using the link at the bottom of Longhorn UI
s
Sent in this thread.
f
There are a lot of rebuilding error like this one:
Copy code
2022-11-01T07:45:48.921548657Z time="2022-11-01T07:45:48Z" level=warning msg="Error syncing Longhorn engine" controller=longhorn-engine engine=longhorn-system/pvc-af69ea5b-9798-443a-8527-583c5fd35b70-e-bdb26b32 error="failed to sync engine for longhorn-system/pvc-af69ea5b-9798-443a-8527-583c5fd35b70-e-bdb26b32: failed to start rebuild for pvc-af69ea5b-9798-443a-8527-583c5fd35b70-r-2bb6d72d of pvc-af69ea5b-9798-443a-8527-583c5fd35b70-e-bdb26b32: proxyServer=10.52.1.173:8501 destination=10.52.1.173:10001: failed to list replicas for volume: rpc error: code = Unknown desc = failed to list replicas for volume 10.52.1.173:10001: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 10.52.1.173:10001: connect: connection refused\"" node=harvester002
It seems that the CNI network has some problem?
s
Well - there are plenty of other volumes which do work. The Harvester cluster is up. All three nodes are healthy and can communicate. Other VMs are working okay. I might just stop wasting your time and delete the Harvester VM which this volume is connected to and rebuild it. I hoped that by preserving the VMs and volumes in this state that I might be of some assistance to the project :-/
170 Views