This message was deleted.
# longhorn-storage
a
This message was deleted.
l
Something is consuming disk storage space
Could be snapshots
Can’t see what disk it is … as you haven’t expanded any of the unschedulable nodes
Something is requesting PVC’s of a size that makes allocated space look like that
i
I got a reply from my colleagues. Here is what they said:
Yeah something quite wrong is going on with our longhorn. I reset our cluster completely ( which included restarting longhorn which got us into a clean state with longhorn before any PVCs were even trying to be created ( first screenshot ).
Then one PVC started to get created I even saw weird behaviour happening, this PVC is only requesting 10 Gi of space but wasn't getting Attached straight away and was staying in detached for some reason ( second and third screenshots ). Which seems odd because with only one PVC I would've thought it right away would get bound and attached and be healthy since at that time longhorn was only managing that PVC.
Then checking the nodes page again I can see that each node is already using double what the PVC is requesting, and longhorn is already trying to re balance the replicas ( screenshots 4,5, and 6 ).
From what I can tell we have no instruction telling longhorn to create any snapshots so I don't think that snapshots should be taking up any of the allocated space. But one thing I have noticed today whilst looking at this is that all of the Disks seems to have the same name ( screenshot 7 ) which I'm now suspicious of if its causing Longhorn to get confused on which disk is which, tho I'm not sure where that name is even set from...
Sorry for screenshot spam, it just makes it much easier to explain what I'm seeing.
I have another follow up. Our company had a power-outage a while ago, and it caused problems with the cluster disks. We ended up just restoring the disks from a working copy. When we run
lvdisplay
, our lvm on nodes 1 and 3 are saying that they were created by node 2 and all of the nodes have the same disk name and LV UUID I don't really know what kind of issues this could cause but all 3 nodes have an LV with the same UUID and creation date. Do you think its possible that is tripping up longhorn and causing issues if all of the LVMs have the same UUID?
Looks like having the same UUIDs for things was what was causing the problem! At least for now everything is working
❤️ 1
l
Interesting! And good find.
What did you do to update the UUID of the LV’s?
i
I believe we ended up just deleting them and re-creating them. They were the same because they were copied from a VM template we took. The data we had there wasn't important so we were ok with losing it.
l
Aaah okay - makes sense.
Don’t you have bootstrap automation for setting up the clusters, disk setup and so on?
i
We do, but thought it would be faster to copy a template VM. We're using RKE2. It takes us maybe 1-2 hours for a full restore. I think this team thought they could avoid the cluster bootstrapping process if they restarted from a template
l
does bootstrapping take more than two hours per node? Wow!
i
Ah sorry, not per node. For the whole cluster. Per-node is maybe like 15-30m just configuring RKE2 and waiting for it to start
🤔 1
l
hmm ok