adamant-kite-43734
10/24/2022, 11:37 AMfamous-journalist-11332
10/25/2022, 10:48 PMfamous-journalist-11332
10/25/2022, 10:49 PMsticky-summer-13450
10/26/2022, 6:29 AMrancher@harvester001:~> sudo du -sh /var
1.4T /var
rancher@harvester001:~> sudo du -sh /var/lib/longhorn/replicas/
1.2T /var/lib/longhorn/replicas/
Of the files in /var/lib/longhorn/replicas/
on node 001, about 100GB start with backup-of-
and 1.1TB start with pvc-
On the other two nodes /var/lib/longhorn/replicas/
has between 400GB and 470GB of files which start with pvc-
Is it possible that there are orphan files in that folder on node 001 which Longhorn has somehow left and forgotten about?famous-journalist-11332
10/26/2022, 6:32 AMIs it possible that there are orphan files in that folder on node 001 which Longhorn has somehow left and forgotten about?It is possible. Which Longhorn version is it?
famous-journalist-11332
10/26/2022, 6:35 AMsetting -> orphaned data
?sticky-summer-13450
10/26/2022, 6:49 AMfamous-journalist-11332
10/26/2022, 7:06 AMsticky-summer-13450
10/26/2022, 7:19 AMsticky-summer-13450
10/26/2022, 2:27 PMkubectl
etc, but...
I have done that kubectl
command, and grepped/awked out the first column of the replicas on the node 001, and compared that with the list of files on the disk. The two do not match because the format is slightly different, but they also do not match because the suffix number is not the same.
For instance, the replica from `kubectl`:
pvc-2909b66d-9e7e-4b12-98f8-c94ccfc08357-r-f09879a8
and the similarly named files on node 001:
pvc-2909b66d-9e7e-4b12-98f8-c94ccfc08357-2b05ad20
pvc-2909b66d-9e7e-4b12-98f8-c94ccfc08357-407c9cfc
pvc-2909b66d-9e7e-4b12-98f8-c94ccfc08357-600d4f7b
pvc-2909b66d-9e7e-4b12-98f8-c94ccfc08357-620b1017
pvc-2909b66d-9e7e-4b12-98f8-c94ccfc08357-6281e0ba
pvc-2909b66d-9e7e-4b12-98f8-c94ccfc08357-6a5d396e
pvc-2909b66d-9e7e-4b12-98f8-c94ccfc08357-7656b4c9
pvc-2909b66d-9e7e-4b12-98f8-c94ccfc08357-7cba3f42
pvc-2909b66d-9e7e-4b12-98f8-c94ccfc08357-804f169c
pvc-2909b66d-9e7e-4b12-98f8-c94ccfc08357-80ff60c9
pvc-2909b66d-9e7e-4b12-98f8-c94ccfc08357-82ffb4c3
pvc-2909b66d-9e7e-4b12-98f8-c94ccfc08357-8602faf7
pvc-2909b66d-9e7e-4b12-98f8-c94ccfc08357-8d4b495b
pvc-2909b66d-9e7e-4b12-98f8-c94ccfc08357-9c1bfb80
pvc-2909b66d-9e7e-4b12-98f8-c94ccfc08357-a4aee71a
pvc-2909b66d-9e7e-4b12-98f8-c94ccfc08357-adb8e903
pvc-2909b66d-9e7e-4b12-98f8-c94ccfc08357-ce33ffa6
pvc-2909b66d-9e7e-4b12-98f8-c94ccfc08357-d5214e74
pvc-2909b66d-9e7e-4b12-98f8-c94ccfc08357-df773b10
pvc-2909b66d-9e7e-4b12-98f8-c94ccfc08357-eaa947af
pvc-338f5368-e257-4a27-ba60-ace45f0e4501-e13d5a01
There is no f09879a8
suffix in the file list.famous-journalist-11332
10/27/2022, 12:39 AMkubectl -n longhorn-system get <http://replicas.longhorn.io|replicas.longhorn.io> -o jsonpath='{.items[*].spec.dataDirectoryName}'
2. Check the /var/lib/longhorn/replicas
folder on node001
to see which directory inside it doesn’t appear on the above liststicky-summer-13450
10/27/2022, 7:29 AMfor l in $(kubectl -n longhorn-system get <http://replicas.longhorn.io|replicas.longhorn.io> -o jsonpath='{.items[*].spec.dataDirectoryName}' --context=harvester-cluster); do echo $l; done > harvester-cluster-kubectl-replicas-dataDirectoryName.txt
sticky-summer-13450
10/27/2022, 7:30 AM/var/lib/longhorn/replicas/
on node 001 is a flat directory containing no directories, only files.sticky-summer-13450
10/27/2022, 7:55 AM/var/lib/longhorn/replicas/
, and start it all up again.sticky-summer-13450
10/27/2022, 7:56 AMfamous-journalist-11332
10/27/2022, 10:36 PMI think this is not true. From the output you can see a bunch directories. For example, for theon node 001 is a flat directory containing no directories, only files./var/lib/longhorn/replicas/
pvc-5557f46b-8410-40d1-b2e9-ce034f5dcb3a
there are a bunch of directories:
pvc-5557f46b-8410-40d1-b2e9-ce034f5dcb3a-3d81c994
pvc-5557f46b-8410-40d1-b2e9-ce034f5dcb3a-44d01e56
pvc-5557f46b-8410-40d1-b2e9-ce034f5dcb3a-4cfbc1db
pvc-5557f46b-8410-40d1-b2e9-ce034f5dcb3a-4d9474ec
pvc-5557f46b-8410-40d1-b2e9-ce034f5dcb3a-52b328ed
pvc-5557f46b-8410-40d1-b2e9-ce034f5dcb3a-5a0fdd2a
pvc-5557f46b-8410-40d1-b2e9-ce034f5dcb3a-5fbfd8d9
pvc-5557f46b-8410-40d1-b2e9-ce034f5dcb3a-686f7f6d
pvc-5557f46b-8410-40d1-b2e9-ce034f5dcb3a-7630c133
pvc-5557f46b-8410-40d1-b2e9-ce034f5dcb3a-80d3e184
pvc-5557f46b-8410-40d1-b2e9-ce034f5dcb3a-823113c5
pvc-5557f46b-8410-40d1-b2e9-ce034f5dcb3a-847ab52b
You can check these directories against this active list https://rancher-users.slack.com/files/U96EY1QHZ/F048ARFS4SX/harvester-cluster-kubectl-replicas-datadirectoryname.txt. Then remove the directory that is not in the active list.sticky-summer-13450
10/27/2022, 10:56 PMI think this is not true. From the output you can see a bunch directories.Oops - that's embarrassing - I'd convinced myself they were files and not directories. <hangs head in shame> 😞 Yes - now I look properly and do a diff between the output of that
kubectl
command and the directory list I can see there are a few (a very small number) that are in both.
When I'm fully awake tomorrow I'll do the ls and the kubectl
at the same instant and remove any folders which exist on disk and are not in the kubectl
output - checking that their modification timestamp is some time in the past.
Thanks!!!sticky-summer-13450
10/28/2022, 3:06 PMkubectl
to get the list of expected directories, uses ssh
to get the directories on disk from the node, and checks whether the directory of disk is expected from the kubectl
output.
total of 731 directories found on disk
total of 92 directories expected from kubectl
total of 704 directories are not expected
None of them have changed since Oct 24th, which was the last time that particular node was restarted, so I feel I can add rm -rf /var/lib/longhorn/replicas/%s
to the script.famous-journalist-11332
10/28/2022, 10:49 PMtotal of 704 directories are not expectedThis matched the huge observed amount of used space
sticky-summer-13450
10/29/2022, 7:20 AMThis matched the huge observed amount of used spaceYes. Removing the folders, with their contents, removed around 1TB of data from the 2TB disk - and some of the degraded volumes rebuilt. Then I started upgrading Harvester to v1.1.0, bringing Longhorn up to v1.3.2 - where I can see the
Setting -> Orphaned data
option, where I could prune the other orphaned data from the other nodes.
There are still some failed volumes. I don't understand why this one cannot be rebuilt from the one healthy copy.sticky-summer-13450
10/29/2022, 7:51 AMsticky-summer-13450
10/29/2022, 3:48 PM[longhorn-instance-manager] time="2022-10-29T15:44:41Z" level=info msg="Process Manager: prepare to create process pvc-af69ea5b-9798-443a-8527-583c5fd35b70-r-4ada2db2"
[longhorn-instance-manager] time="2022-10-29T15:44:41Z" level=debug msg="Process Manager: validate process path: /host/var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-v1.3.2/longhorn dir: /host/var/lib/longhorn/engine-binaries/ image: longhornio-longhorn-engine-v1.3.2 binary: longhorn"
[longhorn-instance-manager] time="2022-10-29T15:44:41Z" level=info msg="Process Manager: created process pvc-af69ea5b-9798-443a-8527-583c5fd35b70-r-4ada2db2"
[pvc-af69ea5b-9798-443a-8527-583c5fd35b70-r-4ada2db2] time="2022-10-29T15:44:41Z" level=info msg="Creating volume /host/var/lib/longhorn/replicas/pvc-af69ea5b-9798-443a-8527-583c5fd35b70-4280b053, size 10737418240/512"
[pvc-af69ea5b-9798-443a-8527-583c5fd35b70-r-4ada2db2] time="2022-10-29T15:44:41Z" level=info msg="Starting to create disk" disk=000
time="2022-10-29T15:44:41Z" level=info msg="Finished creating disk" disk=000
[pvc-af69ea5b-9798-443a-8527-583c5fd35b70-r-4ada2db2] time="2022-10-29T15:44:41Z" level=info msg="Listening on sync agent server 0.0.0.0:10182"
time="2022-10-29T15:44:41Z" level=info msg="Listening on gRPC Replica server 0.0.0.0:10180"
time="2022-10-29T15:44:41Z" level=info msg="Listening on data server 0.0.0.0:10181"
[pvc-af69ea5b-9798-443a-8527-583c5fd35b70-r-4ada2db2] time="2022-10-29T15:44:41Z" level=info msg="Listening on sync 0.0.0.0:10182"
[longhorn-instance-manager] time="2022-10-29T15:44:41Z" level=info msg="Process pvc-af69ea5b-9798-443a-8527-583c5fd35b70-r-4ada2db2 has started at localhost:10180"
[longhorn-instance-manager] time="2022-10-29T15:44:46Z" level=debug msg="Process Manager: prepare to delete process pvc-af69ea5b-9798-443a-8527-583c5fd35b70-r-4ada2db2"
[longhorn-instance-manager] time="2022-10-29T15:44:46Z" level=debug msg="Process Manager: deleted process pvc-af69ea5b-9798-443a-8527-583c5fd35b70-r-4ada2db2"
[longhorn-instance-manager] time="2022-10-29T15:44:46Z" level=debug msg="Process Manager: trying to stop process pvc-af69ea5b-9798-443a-8527-583c5fd35b70-r-4ada2db2"
[longhorn-instance-manager] time="2022-10-29T15:44:46Z" level=info msg="wait for process pvc-af69ea5b-9798-443a-8527-583c5fd35b70-r-4ada2db2 to shutdown"
[longhorn-instance-manager] time="2022-10-29T15:44:46Z" level=debug msg="Process Manager: wait for process pvc-af69ea5b-9798-443a-8527-583c5fd35b70-r-4ada2db2 to shutdown before unregistering process"
[pvc-af69ea5b-9798-443a-8527-583c5fd35b70-r-4ada2db2] time="2022-10-29T15:44:46Z" level=warning msg="Received signal interrupt to shutdown"
[pvc-af69ea5b-9798-443a-8527-583c5fd35b70-r-4ada2db2] time="2022-10-29T15:44:46Z" level=warning msg="Starting to execute registered shutdown func <http://github.com/longhorn/longhorn-engine/app/cmd.startReplica.func4|github.com/longhorn/longhorn-engine/app/cmd.startReplica.func4>"
[longhorn-instance-manager] time="2022-10-29T15:44:46Z" level=info msg="Process Manager: process pvc-af69ea5b-9798-443a-8527-583c5fd35b70-r-4ada2db2 stopped"
[longhorn-instance-manager] time="2022-10-29T15:44:46Z" level=debug msg="Process Manager: prepare to delete process pvc-af69ea5b-9798-443a-8527-583c5fd35b70-r-4ada2db2"
[longhorn-instance-manager] time="2022-10-29T15:44:46Z" level=debug msg="Process Manager: deleted process pvc-af69ea5b-9798-443a-8527-583c5fd35b70-r-4ada2db2"
[longhorn-instance-manager] time="2022-10-29T15:44:46Z" level=info msg="Process Manager: successfully unregistered process pvc-af69ea5b-9798-443a-8527-583c5fd35b70-r-4ada2db2"
[longhorn-instance-manager] time="2022-10-29T15:44:46Z" level=info msg="Process Manager: successfully unregistered process pvc-af69ea5b-9798-443a-8527-583c5fd35b70-r-4ada2db2"
[longhorn-instance-manager] time="2022-10-29T15:44:46Z" level=debug msg="Process Manager: prepare to delete process pvc-af69ea5b-9798-443a-8527-583c5fd35b70-r-4ada2db2"
famous-journalist-11332
10/31/2022, 10:11 PMsticky-summer-13450
11/01/2022, 7:47 AMfamous-journalist-11332
11/04/2022, 12:02 AM2022-11-01T07:45:48.921548657Z time="2022-11-01T07:45:48Z" level=warning msg="Error syncing Longhorn engine" controller=longhorn-engine engine=longhorn-system/pvc-af69ea5b-9798-443a-8527-583c5fd35b70-e-bdb26b32 error="failed to sync engine for longhorn-system/pvc-af69ea5b-9798-443a-8527-583c5fd35b70-e-bdb26b32: failed to start rebuild for pvc-af69ea5b-9798-443a-8527-583c5fd35b70-r-2bb6d72d of pvc-af69ea5b-9798-443a-8527-583c5fd35b70-e-bdb26b32: proxyServer=10.52.1.173:8501 destination=10.52.1.173:10001: failed to list replicas for volume: rpc error: code = Unknown desc = failed to list replicas for volume 10.52.1.173:10001: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 10.52.1.173:10001: connect: connection refused\"" node=harvester002
It seems that the CNI network has some problem?sticky-summer-13450
11/04/2022, 9:20 AM