This message was deleted Rancher Users #longhorn-storage

Join Slack

This message was deleted.

# longhorn-storage

adamant-kite-43734

02/28/2025, 11:11 AM

This message was deleted.

loud-daybreak-83328

02/28/2025, 11:16 AM

I'm wondering if increasing the number of csi-attacher pods from the default of 3 would help at all. I can't find any documentation that recommends doing that on larger deployments though.

bland-article-62755

02/28/2025, 3:22 PM

are all of your nodes also longhorn nodes? They all basically have the same amount of data?

loud-daybreak-83328

02/28/2025, 3:24 PM

Some nodes have more storage than others, but all (except one) are longhorn nodes. The replicas are spread around and I think a few nodes have more replicas than others, but it is spread out.

loud-daybreak-83328

02/28/2025, 3:25 PM

It actually really feels like longhorn in general is super-slow and cpu usage is high. I haven't been able to track down what it's doing that's creating such a load.

bland-article-62755

02/28/2025, 3:25 PM

My first thought was that you had bin packing or something and it's trying to place PVCs on nodes who's storage is full.

bland-article-62755

02/28/2025, 3:26 PM

how big are the volumes?

loud-daybreak-83328

02/28/2025, 3:26 PM

I just reviewed the node list and nothing seems to be full or even over-allocated The volumes are various sizes, I experienced the slowness with a 10Gi volume with nothing on it (just a new allocation)

bland-article-62755

02/28/2025, 3:26 PM

Are they all NVME? Are they single disks or raided?

loud-daybreak-83328

02/28/2025, 3:27 PM

They are single disks, nvme, and added individually to each node in the longhorn configuration

bland-article-62755

02/28/2025, 3:27 PM

loud-daybreak-83328

02/28/2025, 3:28 PM

just logging into one of the nodes, I don't see any i/o wait in top, but longhorn is using 32 CPUs

bland-article-62755

02/28/2025, 3:28 PM

If you have a lot of attachments, that might not be unusual

bland-article-62755

02/28/2025, 3:28 PM

Since it'd be trying to sync replicas

loud-daybreak-83328

02/28/2025, 3:28 PM

Ok, we definitely do have that.

bland-article-62755

02/28/2025, 3:29 PM

Maybe throughput on the backend network that longhorn should be using?

loud-daybreak-83328

02/28/2025, 3:31 PM

That could be...but might be tough for us to track since it's a busy cluster so there's already a lot of internode comms

loud-daybreak-83328

02/28/2025, 3:31 PM

they're all connected via 100Gb

bland-article-62755

02/28/2025, 3:32 PM

Maybe have the network folks check the switch for any packet loss?

loud-daybreak-83328

02/28/2025, 3:33 PM

That is something we can definitely have them do.

bland-article-62755

02/28/2025, 3:33 PM

I know some switches keep track of that metric.

loud-daybreak-83328

02/28/2025, 3:34 PM

I'm not sure how many replicas each volume has, but I can try to find out, maybe reducing the number of replicas can help, down to 2 from 3 if they are.

bland-article-62755

02/28/2025, 3:34 PM

Outside of that

dmesg

might give some clues, but I'm not sure.

bland-article-62755

02/28/2025, 3:35 PM

By default it's normally 3. You can also check to make sure there's not backup jobs going on at the same time.

bland-article-62755

02/28/2025, 3:35 PM

that might slow things down

loud-daybreak-83328

02/28/2025, 3:36 PM

Yeah, the backups have been a known issue for awhile (generating a lot of traffic), we'd have to try and confirm that they are running while we see the issues...seems like it should be done by the end of the day, but they might be taking much longer than expected.

loud-daybreak-83328

02/28/2025, 3:38 PM

We generally have either 2 or 3 replicas of all volumes (depending on the storageclass that was used), but we also have 700 volumes

bland-article-62755

02/28/2025, 3:40 PM

yeah, I have fewer nodes, but more volumes than that in our production.

bland-article-62755

02/28/2025, 3:40 PM

Typically ours are ~ 4GiB

bland-article-62755

02/28/2025, 3:40 PM

But I think we have somewhere around 1500-2000 volumes.

bland-article-62755

02/28/2025, 3:40 PM

but they're not all getting hammered at once.

bland-article-62755

02/28/2025, 3:41 PM

I think typically around 100-500 are in use at a time

bland-article-62755

02/28/2025, 3:41 PM

But that's with 6 nodes.

loud-daybreak-83328

02/28/2025, 3:42 PM

Ok, from what I can tell we have 688 that might be active, so it's not an insane amount over what could be expected, and we have about 15 nodes in the longhorn cluster. Ours definitely have a variety of sizes and a variety of workloads, so no real consistency of that.

bland-article-62755

02/28/2025, 3:43 PM

Are you using the longhorn ui?

bland-article-62755

02/28/2025, 3:43 PM

The dashboard normally pull events that are pretty useful.

loud-daybreak-83328

02/28/2025, 3:43 PM

Yeah, the UI is super slow though too

loud-daybreak-83328

02/28/2025, 3:44 PM

Lots of replicas being started/stopped and snapshots being removed (I'm guessing this is backup activity), or at least I hope it is.

bland-article-62755

02/28/2025, 3:45 PM

more than likely

bland-article-62755

02/28/2025, 3:45 PM

You might need to adjust the cron times as I don't think it syncs to node TZ

loud-daybreak-83328

02/28/2025, 3:45 PM

hmm....I see several of these:

Copy code

Removing unknown replica <tcp://10.42.62.1:11585> in mode ERR from engine

but I'm wondering if that's still part of the process because it all seems to recover and nothing is moving to a degraded state.

bland-article-62755

02/28/2025, 3:46 PM

watch kubectl -n longhorn-system events

if the UI is too slow

bland-article-62755

02/28/2025, 3:47 PM

If you track those down to a node, it could be a bad disk

loud-daybreak-83328

02/28/2025, 3:48 PM

It doesn't look like we're being flooded with messages, so that's good. There's a good time gap between them.

loud-daybreak-83328

02/28/2025, 3:50 PM

Thanks for your help, you've given me a bunch to look into, hopefully we can get to the bottom of it.

bland-article-62755

02/28/2025, 3:50 PM

At least it's a starting place. 🙂

loud-daybreak-83328

02/28/2025, 3:51 PM

yeah, I don't imagine this will be a quick fix at all.

41 Views

Open in Slack

Previous Next