Hi, we're having some longhorn slowness that's bei...
# longhorn-storage
l
Hi, we're having some longhorn slowness that's being really difficult to track and am wondering if anyone has any insight or suggestions on what to look for. Longhorn seems ok, all volumes are healthy, there's nothing in a failed state. We have a lot of volumes (around 700), but a 15 node cluster, so we don't think we're overloading anything. If I create a PVC, this happens immediately, so no issue there. When I launch a pod though, it can take a really long time for the PVC to attach and get to a running state. There's no error, it just takes a really long time (several minutes), which is unexpected. The only message I found that seems fishy is in the longhorn-manager pod which has something like:
Copy code
time="2025-02-28T11:07:01Z" level=warning msg="Cannot auto-balance volume in unknown state" func="controller.(*VolumeController).getReplicaCountForAutoBalanceBestEffort" file="volume_controller.go:2416"
eventually something goes through and it mounts, so I don't know what's up. Anyone have any ideas? Thanks.
I'm wondering if increasing the number of csi-attacher pods from the default of 3 would help at all. I can't find any documentation that recommends doing that on larger deployments though.
b
are all of your nodes also longhorn nodes? They all basically have the same amount of data?
l
Some nodes have more storage than others, but all (except one) are longhorn nodes. The replicas are spread around and I think a few nodes have more replicas than others, but it is spread out.
It actually really feels like longhorn in general is super-slow and cpu usage is high. I haven't been able to track down what it's doing that's creating such a load.
b
My first thought was that you had bin packing or something and it's trying to place PVCs on nodes who's storage is full.
how big are the volumes?
l
I just reviewed the node list and nothing seems to be full or even over-allocated The volumes are various sizes, I experienced the slowness with a 10Gi volume with nothing on it (just a new allocation)
b
Are they all NVME? Are they single disks or raided?
l
They are single disks, nvme, and added individually to each node in the longhorn configuration
b
hm
l
just logging into one of the nodes, I don't see any i/o wait in top, but longhorn is using 32 CPUs
b
If you have a lot of attachments, that might not be unusual
Since it'd be trying to sync replicas
l
Ok, we definitely do have that.
b
Maybe throughput on the backend network that longhorn should be using?
l
That could be...but might be tough for us to track since it's a busy cluster so there's already a lot of internode comms
they're all connected via 100Gb
b
Maybe have the network folks check the switch for any packet loss?
l
That is something we can definitely have them do.
b
I know some switches keep track of that metric.
l
I'm not sure how many replicas each volume has, but I can try to find out, maybe reducing the number of replicas can help, down to 2 from 3 if they are.
b
Outside of that
dmesg
might give some clues, but I'm not sure.
By default it's normally 3. You can also check to make sure there's not backup jobs going on at the same time.
that might slow things down
l
Yeah, the backups have been a known issue for awhile (generating a lot of traffic), we'd have to try and confirm that they are running while we see the issues...seems like it should be done by the end of the day, but they might be taking much longer than expected.
We generally have either 2 or 3 replicas of all volumes (depending on the storageclass that was used), but we also have 700 volumes
b
yeah, I have fewer nodes, but more volumes than that in our production.
Typically ours are ~ 4GiB
But I think we have somewhere around 1500-2000 volumes.
but they're not all getting hammered at once.
I think typically around 100-500 are in use at a time
But that's with 6 nodes.
l
Ok, from what I can tell we have 688 that might be active, so it's not an insane amount over what could be expected, and we have about 15 nodes in the longhorn cluster. Ours definitely have a variety of sizes and a variety of workloads, so no real consistency of that.
b
Are you using the longhorn ui?
The dashboard normally pull events that are pretty useful.
l
Yeah, the UI is super slow though too
Lots of replicas being started/stopped and snapshots being removed (I'm guessing this is backup activity), or at least I hope it is.
b
more than likely
You might need to adjust the cron times as I don't think it syncs to node TZ
l
hmm....I see several of these:
Copy code
Removing unknown replica <tcp://10.42.62.1:11585> in mode ERR from engine
but I'm wondering if that's still part of the process because it all seems to recover and nothing is moving to a degraded state.
b
watch kubectl -n longhorn-system events
if the UI is too slow
If you track those down to a node, it could be a bad disk
l
It doesn't look like we're being flooded with messages, so that's good. There's a good time gap between them.
Thanks for your help, you've given me a bunch to look into, hopefully we can get to the bottom of it.
b
At least it's a starting place. 🙂
l
yeah, I don't imagine this will be a quick fix at all.