This message was deleted.
# longhorn-storage
a
This message was deleted.
πŸ˜“ 1
q
Not that I know of. That's odd. If you don't mind writing a Github issue for that and attaching a support bundle, that would be interesting to dig into.
b
I appreciate the offer but the support bundle has a fair amount of sensitive info in such as URLs and VM names. Is it possible to email it to someone directly instead of posting it publicly?
q
Sure, a mail to longhorn-support-bundle@suse.com that mentions the issue number would work.
b
Thank you. I have created https://github.com/longhorn/longhorn/issues/8450 and emailed across the support bundle
l
Are you using Velero or Kasten to trigger Longhorn to do CSI snapshots?
b
No - there is currently no snapshotting. Out of the box harvester and longhorn with no real customisation apart from multiple storage classes (replica 3 and replica 1)
... and the storage network is on a VLAN
l
hmm did you properly setup the Longhorn disk configuration?
b
We are using the default out of the box longhorn settings in this test env, 25% minimal available storage and 100% overprovisioning. The physicals disks still got over allocated before becoming un-schedulable.
q
In the bundle I notice that there is a Harvester setting,
overcommit-config
with a value for storage of 200. That may be what is driving the behavior.
b
Ah yes! The defaults for harvester are
Copy code
{
  "cpu": 1600,
  "memory": 150,
  "storage": 200
}
I guess this is overriding longhorn?
q
It appears to be. I'm not exactly sure of the precedence. You could (a) experiment by changing the setting (no idea what happens if you pick a setting that is already exceeded) or (b) ask in the harvester channel, perhaps. Not washing my hands of it, but still poking around on this side.
b
I have already used disk eviction in longhorn to rebalance over provisioned volumes but it's good to know what is causing this. I will ask in the harvester channel to see if people can explain the precedence of items. Thank you for finding that setting πŸ™‚
b
Yes, Harvester unconditionally overwrites Longhorn's overcommit with the value from Harvester's overcommit settings (storage)
πŸ‘ 2
b
Thinking about this more, is this a bug? The disks become un-schedulable in longhorn as its overcommitment value differs. Is harvester somehow ignoring the un-schedulable taint?
And the warning showing they are not schedulable is propagated back to the harvester UI
f
Hey @bored-painting-68221, could you take a look at the support bundle for this one when you get a second, specifically focusing on why the
over-commit
setting doesn't appear to be propagating to the
storage-over-provisioning-setting
?
b
Sure! Would you send it to me over SUSE's private Slack? Harvester does not seem to account for the possibility of the Longhorn setting behind changed outside of Harvester's management.
in fact, it will probably be resilient to any controller syncs or Harvester pod restarts, as Harvester writes a checksum to an annotation of the setting's current value. So Harvester sees the checksum is the same as the annotation, and does not consult Longhorn, and then considers its job done (nothing to update)
βœ… 1
I'm guessing this support bundle was taken via the Longhorn dashboard? It is missing the harvester-system namespace for pod logs. You can see if it's due to the behavior I outlined above by removing the
<http://harvesterhci.io/hash|harvesterhci.io/hash>
annotation from the
overcommit-config
setting.harvesterhci.io object
f
Sorry @bored-painting-68221. I had my e-mail open to forward it to you and never did... Sending it now.
b
Hmm, actually, playing around with this on a Harvester cluster, it might not work at all
Yeah... looks like the setting.harvesterhci.io object has a .default field which Harvester doesn't propagate. After setting the value with kubectl or the UI it seems to propagate to Longhorn, since Harvester reads the .value field and not the .default field.
Which I guess brings us full circle back to how it got overprovisioned in the first place, if Harvester's default storage overcommit never propagated to Longhorn in the first place lol (unless perhaps it was manually set once before?)
q
Nope, check the defect. Harvester wanted to tell Longhorn to overcommit and didn't, but then Longhorn did anyway, apparently through an interesting bug in competitive scheduling of replicas.
βœ… 1
f
The support bundle shows the Longhorn setting was
100
well before the overprovisioning, so we think we need to fix a Longhorn issue as @quick-river-12881 described. Just wanted to make sure we understood why it was
100
and not the expected
200
.
b
Where did you see Harvester's intent to proxy the setting to Longhorn? The Harvester controller seems to ignore the overcommit-config object when it's just the default:
default: '{"cpu":1600,"memory":150,"storage":200}'
and it won't call Longhorn until its
value: {"storage":200}
q
Sorry, "intent" was meant conceptually, not that there was any evidence of an actual attempt.
βœ… 1
b
Hey - posted in the issue as well but we are consistently hitting this bug and struggling to balance our storage. Do you have any tips for things we can try? With eviction we find longhorn is immediately over committing the disk we just evicted as soon as we make it schedulable again.