This message was deleted Rancher Users #longhorn-storage

Join Slack

This message was deleted.

# longhorn-storage

adamant-kite-43734

10/24/2022, 11:37 AM

This message was deleted.

famous-journalist-11332

10/25/2022, 10:48 PM

Usage space on node 001 is too high

famous-journalist-11332

10/25/2022, 10:49 PM

Can you check if any other app eating storage on node 001?

sticky-summer-13450

10/26/2022, 6:29 AM

The node 001 is only running Harvester v1.0.3 - no other software is running on the node. The space is taken up mostly in the Longhorn Replicas folder.

Copy code

rancher@harvester001:~> sudo du -sh /var
1.4T	/var

rancher@harvester001:~> sudo du -sh /var/lib/longhorn/replicas/
1.2T	/var/lib/longhorn/replicas/

Of the files in

/var/lib/longhorn/replicas/

on node 001, about 100GB start with

backup-of-

and 1.1TB start with

pvc-

On the other two nodes

/var/lib/longhorn/replicas/

has between 400GB and 470GB of files which start with

pvc-

Is it possible that there are orphan files in that folder on node 001 which Longhorn has somehow left and forgotten about?

famous-journalist-11332

10/26/2022, 6:32 AM

Is it possible that there are orphan files in that folder on node 001 which Longhorn has somehow left and forgotten about?

It is possible. Which Longhorn version is it?

famous-journalist-11332

10/26/2022, 6:35 AM

If Longhorn 1.3.x, can you check the orphan replicas in

setting -> orphaned data

sticky-summer-13450

10/26/2022, 6:49 AM

The current version of Harvester comes with v1.2.4 of Longhorn.

famous-journalist-11332

10/26/2022, 7:06 AM

In this case, we would have to check the replicas gilder manually to see which one is orphaned (Not corresponding to any replica “kubectl get replicas -n longhorn-system”)

sticky-summer-13450

10/26/2022, 7:19 AM

Currently on the underground on the way into the office... I don't understand 'replicas gilder'. I assume you mean compare the output of that kubectl command against the files in the directory. I can do that, between meetings and development work in the office.

sticky-summer-13450

10/26/2022, 2:27 PM

From work it's too difficult to attach files which are the output of

kubectl

etc, but... I have done that

kubectl

command, and grepped/awked out the first column of the replicas on the node 001, and compared that with the list of files on the disk. The two do not match because the format is slightly different, but they also do not match because the suffix number is not the same. For instance, the replica from `kubectl`:

Copy code

pvc-2909b66d-9e7e-4b12-98f8-c94ccfc08357-r-f09879a8

and the similarly named files on node 001:

Copy code

pvc-2909b66d-9e7e-4b12-98f8-c94ccfc08357-2b05ad20
pvc-2909b66d-9e7e-4b12-98f8-c94ccfc08357-407c9cfc
pvc-2909b66d-9e7e-4b12-98f8-c94ccfc08357-600d4f7b
pvc-2909b66d-9e7e-4b12-98f8-c94ccfc08357-620b1017
pvc-2909b66d-9e7e-4b12-98f8-c94ccfc08357-6281e0ba
pvc-2909b66d-9e7e-4b12-98f8-c94ccfc08357-6a5d396e
pvc-2909b66d-9e7e-4b12-98f8-c94ccfc08357-7656b4c9
pvc-2909b66d-9e7e-4b12-98f8-c94ccfc08357-7cba3f42
pvc-2909b66d-9e7e-4b12-98f8-c94ccfc08357-804f169c
pvc-2909b66d-9e7e-4b12-98f8-c94ccfc08357-80ff60c9
pvc-2909b66d-9e7e-4b12-98f8-c94ccfc08357-82ffb4c3
pvc-2909b66d-9e7e-4b12-98f8-c94ccfc08357-8602faf7
pvc-2909b66d-9e7e-4b12-98f8-c94ccfc08357-8d4b495b
pvc-2909b66d-9e7e-4b12-98f8-c94ccfc08357-9c1bfb80
pvc-2909b66d-9e7e-4b12-98f8-c94ccfc08357-a4aee71a
pvc-2909b66d-9e7e-4b12-98f8-c94ccfc08357-adb8e903
pvc-2909b66d-9e7e-4b12-98f8-c94ccfc08357-ce33ffa6
pvc-2909b66d-9e7e-4b12-98f8-c94ccfc08357-d5214e74
pvc-2909b66d-9e7e-4b12-98f8-c94ccfc08357-df773b10
pvc-2909b66d-9e7e-4b12-98f8-c94ccfc08357-eaa947af
pvc-338f5368-e257-4a27-ba60-ace45f0e4501-e13d5a01

There is no

f09879a8

suffix in the file list.

👀 1

famous-journalist-11332

10/27/2022, 12:39 AM

Oh. sorry. You should: 1.

kubectl -n longhorn-system get <http://replicas.longhorn.io|replicas.longhorn.io> -o jsonpath='{.items[*].spec.dataDirectoryName}'

2. Check the

/var/lib/longhorn/replicas

folder on

node001

to see which directory inside it doesn’t appear on the above list

👀 1

sticky-summer-13450

10/27/2022, 7:29 AM

There are no directories listed in the above command - only files.

Copy code

for l in $(kubectl -n longhorn-system get <http://replicas.longhorn.io|replicas.longhorn.io> -o jsonpath='{.items[*].spec.dataDirectoryName}' --context=harvester-cluster); do echo $l; done > harvester-cluster-kubectl-replicas-dataDirectoryName.txt

harvester-cluster-kubectl-replicas-dataDirectoryName.txt

sticky-summer-13450

10/27/2022, 7:30 AM

/var/lib/longhorn/replicas/

on node 001 is a flat directory containing no directories, only files.

harvester001-ls-var-lib-longhorn-replicas.txt

sticky-summer-13450

10/27/2022, 7:55 AM

Honestly, I'm wondering if I should shutdown all of k8s on node 001, delete everything in

/var/lib/longhorn/replicas/

, and start it all up again.

sticky-summer-13450

10/27/2022, 7:56 AM

Since Harvester released v1.1.0 yesterday, I just need to get the current system kind-of working so that an upgrade can work.

famous-journalist-11332

10/27/2022, 10:36 PM

/var/lib/longhorn/replicas/
on node 001 is a flat directory containing no directories, only files.

I think this is not true. From the output you can see a bunch directories. For example, for the

pvc-5557f46b-8410-40d1-b2e9-ce034f5dcb3a

there are a bunch of directories:

Copy code

pvc-5557f46b-8410-40d1-b2e9-ce034f5dcb3a-3d81c994
pvc-5557f46b-8410-40d1-b2e9-ce034f5dcb3a-44d01e56
pvc-5557f46b-8410-40d1-b2e9-ce034f5dcb3a-4cfbc1db
pvc-5557f46b-8410-40d1-b2e9-ce034f5dcb3a-4d9474ec
pvc-5557f46b-8410-40d1-b2e9-ce034f5dcb3a-52b328ed
pvc-5557f46b-8410-40d1-b2e9-ce034f5dcb3a-5a0fdd2a
pvc-5557f46b-8410-40d1-b2e9-ce034f5dcb3a-5fbfd8d9
pvc-5557f46b-8410-40d1-b2e9-ce034f5dcb3a-686f7f6d
pvc-5557f46b-8410-40d1-b2e9-ce034f5dcb3a-7630c133
pvc-5557f46b-8410-40d1-b2e9-ce034f5dcb3a-80d3e184
pvc-5557f46b-8410-40d1-b2e9-ce034f5dcb3a-823113c5
pvc-5557f46b-8410-40d1-b2e9-ce034f5dcb3a-847ab52b

You can check these directories against this active list https://rancher-users.slack.com/files/U96EY1QHZ/F048ARFS4SX/harvester-cluster-kubectl-replicas-datadirectoryname.txt. Then remove the directory that is not in the active list.

harvester-cluster-kubectl-replicas-dataDirectoryName.txt

sticky-summer-13450

10/27/2022, 10:56 PM

I think this is not true. From the output you can see a bunch directories.

Oops - that's embarrassing - I'd convinced myself they were files and not directories. <hangs head in shame> 😞 Yes - now I look properly and do a diff between the output of that

kubectl

command and the directory list I can see there are a few (a very small number) that are in both. When I'm fully awake tomorrow I'll do the ls and the

kubectl

at the same instant and remove any folders which exist on disk and are not in the

kubectl

output - checking that their modification timestamp is some time in the past. Thanks!!!

👍 1

sticky-summer-13450

10/28/2022, 3:06 PM

Just because I'm paranoid, I'm doing a lot of investigation before I delete. I have some Python on my home development desktop which uses

kubectl

to get the list of expected directories, uses

ssh

to get the directories on disk from the node, and checks whether the directory of disk is expected from the

kubectl

output. total of 731 directories found on disk total of 92 directories expected from kubectl total of 704 directories are not expected None of them have changed since Oct 24th, which was the last time that particular node was restarted, so I feel I can add

rm -rf /var/lib/longhorn/replicas/%s

to the script.

👍 1

famous-journalist-11332

10/28/2022, 10:49 PM

total of 704 directories are not expected

This matched the huge observed amount of used space

sticky-summer-13450

10/29/2022, 7:20 AM

This matched the huge observed amount of used space

Yes. Removing the folders, with their contents, removed around 1TB of data from the 2TB disk - and some of the degraded volumes rebuilt. Then I started upgrading Harvester to v1.1.0, bringing Longhorn up to v1.3.2 - where I can see the

Setting -> Orphaned data

option, where I could prune the other orphaned data from the other nodes. There are still some failed volumes. I don't understand why this one cannot be rebuilt from the one healthy copy.

sticky-summer-13450

10/29/2022, 7:51 AM

^^^ that has been saying 27% restored for days.

sticky-summer-13450

10/29/2022, 3:48 PM

And while I'm here being annoying - do you have any clue why this volume just keeps cycling through attaching and detaching (see animating GIF)? It's stopping one VM in Harvester from starting, and it's stopping a backup of that VM, and it's stopping a conversion from a volume to an image in Harvester. The logs from the instance manager on the node which flashes up for a short time and disappears again says this:

Copy code

[longhorn-instance-manager] time="2022-10-29T15:44:41Z" level=info msg="Process Manager: prepare to create process pvc-af69ea5b-9798-443a-8527-583c5fd35b70-r-4ada2db2"
[longhorn-instance-manager] time="2022-10-29T15:44:41Z" level=debug msg="Process Manager: validate process path: /host/var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-v1.3.2/longhorn dir: /host/var/lib/longhorn/engine-binaries/ image: longhornio-longhorn-engine-v1.3.2 binary: longhorn"
[longhorn-instance-manager] time="2022-10-29T15:44:41Z" level=info msg="Process Manager: created process pvc-af69ea5b-9798-443a-8527-583c5fd35b70-r-4ada2db2"
[pvc-af69ea5b-9798-443a-8527-583c5fd35b70-r-4ada2db2] time="2022-10-29T15:44:41Z" level=info msg="Creating volume /host/var/lib/longhorn/replicas/pvc-af69ea5b-9798-443a-8527-583c5fd35b70-4280b053, size 10737418240/512"
[pvc-af69ea5b-9798-443a-8527-583c5fd35b70-r-4ada2db2] time="2022-10-29T15:44:41Z" level=info msg="Starting to create disk" disk=000
time="2022-10-29T15:44:41Z" level=info msg="Finished creating disk" disk=000
[pvc-af69ea5b-9798-443a-8527-583c5fd35b70-r-4ada2db2] time="2022-10-29T15:44:41Z" level=info msg="Listening on sync agent server 0.0.0.0:10182"
time="2022-10-29T15:44:41Z" level=info msg="Listening on gRPC Replica server 0.0.0.0:10180"
time="2022-10-29T15:44:41Z" level=info msg="Listening on data server 0.0.0.0:10181"
[pvc-af69ea5b-9798-443a-8527-583c5fd35b70-r-4ada2db2] time="2022-10-29T15:44:41Z" level=info msg="Listening on sync 0.0.0.0:10182"
[longhorn-instance-manager] time="2022-10-29T15:44:41Z" level=info msg="Process pvc-af69ea5b-9798-443a-8527-583c5fd35b70-r-4ada2db2 has started at localhost:10180"
[longhorn-instance-manager] time="2022-10-29T15:44:46Z" level=debug msg="Process Manager: prepare to delete process pvc-af69ea5b-9798-443a-8527-583c5fd35b70-r-4ada2db2"
[longhorn-instance-manager] time="2022-10-29T15:44:46Z" level=debug msg="Process Manager: deleted process pvc-af69ea5b-9798-443a-8527-583c5fd35b70-r-4ada2db2"
[longhorn-instance-manager] time="2022-10-29T15:44:46Z" level=debug msg="Process Manager: trying to stop process pvc-af69ea5b-9798-443a-8527-583c5fd35b70-r-4ada2db2"
[longhorn-instance-manager] time="2022-10-29T15:44:46Z" level=info msg="wait for process pvc-af69ea5b-9798-443a-8527-583c5fd35b70-r-4ada2db2 to shutdown"
[longhorn-instance-manager] time="2022-10-29T15:44:46Z" level=debug msg="Process Manager: wait for process pvc-af69ea5b-9798-443a-8527-583c5fd35b70-r-4ada2db2 to shutdown before unregistering process"
[pvc-af69ea5b-9798-443a-8527-583c5fd35b70-r-4ada2db2] time="2022-10-29T15:44:46Z" level=warning msg="Received signal interrupt to shutdown"
[pvc-af69ea5b-9798-443a-8527-583c5fd35b70-r-4ada2db2] time="2022-10-29T15:44:46Z" level=warning msg="Starting to execute registered shutdown func <http://github.com/longhorn/longhorn-engine/app/cmd.startReplica.func4|github.com/longhorn/longhorn-engine/app/cmd.startReplica.func4>"
[longhorn-instance-manager] time="2022-10-29T15:44:46Z" level=info msg="Process Manager: process pvc-af69ea5b-9798-443a-8527-583c5fd35b70-r-4ada2db2 stopped"
[longhorn-instance-manager] time="2022-10-29T15:44:46Z" level=debug msg="Process Manager: prepare to delete process pvc-af69ea5b-9798-443a-8527-583c5fd35b70-r-4ada2db2"
[longhorn-instance-manager] time="2022-10-29T15:44:46Z" level=debug msg="Process Manager: deleted process pvc-af69ea5b-9798-443a-8527-583c5fd35b70-r-4ada2db2"
[longhorn-instance-manager] time="2022-10-29T15:44:46Z" level=info msg="Process Manager: successfully unregistered process pvc-af69ea5b-9798-443a-8527-583c5fd35b70-r-4ada2db2"
[longhorn-instance-manager] time="2022-10-29T15:44:46Z" level=info msg="Process Manager: successfully unregistered process pvc-af69ea5b-9798-443a-8527-583c5fd35b70-r-4ada2db2"
[longhorn-instance-manager] time="2022-10-29T15:44:46Z" level=debug msg="Process Manager: prepare to delete process pvc-af69ea5b-9798-443a-8527-583c5fd35b70-r-4ada2db2"

famous-journalist-11332

10/31/2022, 10:11 PM

We would need a full support bundle to understand why the volume stuck in AD loop. You can generate it using the link at the bottom of Longhorn UI

sticky-summer-13450

11/01/2022, 7:47 AM

Sent in this thread.

famous-journalist-11332

11/04/2022, 12:02 AM

There are a lot of rebuilding error like this one:

Copy code

2022-11-01T07:45:48.921548657Z time="2022-11-01T07:45:48Z" level=warning msg="Error syncing Longhorn engine" controller=longhorn-engine engine=longhorn-system/pvc-af69ea5b-9798-443a-8527-583c5fd35b70-e-bdb26b32 error="failed to sync engine for longhorn-system/pvc-af69ea5b-9798-443a-8527-583c5fd35b70-e-bdb26b32: failed to start rebuild for pvc-af69ea5b-9798-443a-8527-583c5fd35b70-r-2bb6d72d of pvc-af69ea5b-9798-443a-8527-583c5fd35b70-e-bdb26b32: proxyServer=10.52.1.173:8501 destination=10.52.1.173:10001: failed to list replicas for volume: rpc error: code = Unknown desc = failed to list replicas for volume 10.52.1.173:10001: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 10.52.1.173:10001: connect: connection refused\"" node=harvester002

It seems that the CNI network has some problem?

sticky-summer-13450

11/04/2022, 9:20 AM

Well - there are plenty of other volumes which do work. The Harvester cluster is up. All three nodes are healthy and can communicate. Other VMs are working okay. I might just stop wasting your time and delete the Harvester VM which this volume is connected to and rebuild it. I hoped that by preserving the VMs and volumes in this state that I might be of some assistance to the project :-/

179 Views

Open in Slack

Previous Next