This message was deleted Rancher Users #longhorn-storage

Join Slack

This message was deleted.

# longhorn-storage

adamant-kite-43734

04/25/2024, 3:06 PM

This message was deleted.

😓 1

quick-river-12881

04/25/2024, 4:31 PM

Not that I know of. That's odd. If you don't mind writing a Github issue for that and attaching a support bundle, that would be interesting to dig into.

brainy-kilobyte-33711

04/25/2024, 5:03 PM

I appreciate the offer but the support bundle has a fair amount of sensitive info in such as URLs and VM names. Is it possible to email it to someone directly instead of posting it publicly?

quick-river-12881

04/25/2024, 5:09 PM

Sure, a mail to longhorn-support-bundle@suse.com that mentions the issue number would work.

brainy-kilobyte-33711

04/26/2024, 8:40 AM

Thank you. I have created https://github.com/longhorn/longhorn/issues/8450 and emailed across the support bundle

late-needle-80860

04/26/2024, 9:06 AM

Are you using Velero or Kasten to trigger Longhorn to do CSI snapshots?

brainy-kilobyte-33711

04/26/2024, 9:20 AM

No - there is currently no snapshotting. Out of the box harvester and longhorn with no real customisation apart from multiple storage classes (replica 3 and replica 1)

brainy-kilobyte-33711

04/26/2024, 10:09 AM

... and the storage network is on a VLAN

late-needle-80860

04/26/2024, 12:24 PM

hmm did you properly setup the Longhorn disk configuration?

late-needle-80860

04/26/2024, 12:26 PM

https://longhorn.io/docs/1.6.1/best-practices/#minimal-available-storage-and-over-provisioning

brainy-kilobyte-33711

04/26/2024, 12:40 PM

We are using the default out of the box longhorn settings in this test env, 25% minimal available storage and 100% overprovisioning. The physicals disks still got over allocated before becoming un-schedulable.

quick-river-12881

04/26/2024, 4:35 PM

In the bundle I notice that there is a Harvester setting,

overcommit-config

with a value for storage of 200. That may be what is driving the behavior.

brainy-kilobyte-33711

04/26/2024, 4:39 PM

Ah yes! The defaults for harvester are

Copy code

{
  "cpu": 1600,
  "memory": 150,
  "storage": 200
}

I guess this is overriding longhorn?

quick-river-12881

04/26/2024, 4:42 PM

It appears to be. I'm not exactly sure of the precedence. You could (a) experiment by changing the setting (no idea what happens if you pick a setting that is already exceeded) or (b) ask in the harvester channel, perhaps. Not washing my hands of it, but still poking around on this side.

brainy-kilobyte-33711

04/26/2024, 4:53 PM

I have already used disk eviction in longhorn to rebalance over provisioned volumes but it's good to know what is causing this. I will ask in the harvester channel to see if people can explain the precedence of items. Thank you for finding that setting 🙂

bored-painting-68221

04/26/2024, 4:53 PM

Yes, Harvester unconditionally overwrites Longhorn's overcommit with the value from Harvester's overcommit settings (storage)

👍 2

brainy-kilobyte-33711

04/26/2024, 4:55 PM

Thinking about this more, is this a bug? The disks become un-schedulable in longhorn as its overcommitment value differs. Is harvester somehow ignoring the un-schedulable taint?

brainy-kilobyte-33711

04/26/2024, 5:03 PM

And the warning showing they are not schedulable is propagated back to the harvester UI

faint-sunset-36608

04/26/2024, 7:29 PM

Hey @bored-painting-68221, could you take a look at the support bundle for this one when you get a second, specifically focusing on why the

over-commit

setting doesn't appear to be propagating to the

storage-over-provisioning-setting

bored-painting-68221

04/26/2024, 7:57 PM

Sure! Would you send it to me over SUSE's private Slack? Harvester does not seem to account for the possibility of the Longhorn setting behind changed outside of Harvester's management.

bored-painting-68221

04/26/2024, 7:58 PM

in fact, it will probably be resilient to any controller syncs or Harvester pod restarts, as Harvester writes a checksum to an annotation of the setting's current value. So Harvester sees the checksum is the same as the annotation, and does not consult Longhorn, and then considers its job done (nothing to update)

✅ 1

bored-painting-68221

04/26/2024, 8:09 PM

I'm guessing this support bundle was taken via the Longhorn dashboard? It is missing the harvester-system namespace for pod logs. You can see if it's due to the behavior I outlined above by removing the

<http://harvesterhci.io/hash|harvesterhci.io/hash>

annotation from the

overcommit-config

setting.harvesterhci.io object

faint-sunset-36608

04/26/2024, 8:11 PM

Sorry @bored-painting-68221. I had my e-mail open to forward it to you and never did... Sending it now.

bored-painting-68221

04/26/2024, 8:13 PM

Hmm, actually, playing around with this on a Harvester cluster, it might not work at all

bored-painting-68221

04/26/2024, 8:20 PM

Yeah... looks like the setting.harvesterhci.io object has a .default field which Harvester doesn't propagate. After setting the value with kubectl or the UI it seems to propagate to Longhorn, since Harvester reads the .value field and not the .default field.

bored-painting-68221

04/26/2024, 8:30 PM

Which I guess brings us full circle back to how it got overprovisioned in the first place, if Harvester's default storage overcommit never propagated to Longhorn in the first place lol (unless perhaps it was manually set once before?)

quick-river-12881

04/26/2024, 8:36 PM

Nope, check the defect. Harvester wanted to tell Longhorn to overcommit and didn't, but then Longhorn did anyway, apparently through an interesting bug in competitive scheduling of replicas.

✅ 1

faint-sunset-36608

04/26/2024, 8:37 PM

The support bundle shows the Longhorn setting was

well before the overprovisioning, so we think we need to fix a Longhorn issue as @quick-river-12881 described. Just wanted to make sure we understood why it was

and not the expected

bored-painting-68221

04/26/2024, 8:38 PM

Where did you see Harvester's intent to proxy the setting to Longhorn? The Harvester controller seems to ignore the overcommit-config object when it's just the default:

default: '{"cpu":1600,"memory":150,"storage":200}'

and it won't call Longhorn until its

value: {"storage":200}

quick-river-12881

04/26/2024, 8:39 PM

Sorry, "intent" was meant conceptually, not that there was any evidence of an actual attempt.

✅ 1

brainy-kilobyte-33711

05/21/2024, 4:30 PM

Hey - posted in the issue as well but we are consistently hitting this bug and struggling to balance our storage. Do you have any tips for things we can try? With eviction we find longhorn is immediately over committing the disk we just evicted as soon as we make it schedulable again.

2 Views

Open in Slack

Previous Next