I have a cluster on v1.5.5 of Longhorn. I did not ...
# longhorn-storage
l
I have a cluster on v1.5.5 of Longhorn. I did not set the
taintToleration: "...."
key in the
Longhorn
defaultSettings
before upgrading to v1.5+. So now, when introducing new worker nodes in this cluster - the instance-manager does NOT come up. And in the describe body on an instance-manager it says:
TaintToleration failed
. Reading the docs I understand that it can lead to crashing volumes if changing the tolerations. Will this also be the case when proper tolerations are already on the longhorn- manager, driver and ui? So basically I’m hoping to be able to only have to set the
taintToleration: "...."
key in
Longhorn
defaultSettings
. Is this possible? Thank you.
I tried upgrading Longhorn from v1.5.3 to v1.5.4 and at the same time configuring
taintToleration: "...." key in Longhorn defaultSettings.
… On this cluster ( running v1.29.3 Kubernetes ) - not all instance-managers restarted. There’s still a couple of instances left many many days old. Also, I did NOT see any workload failing because they lost their volumes. Ideas and comments are highly appreciated. Thank you.
Anyone?
p
I am unfamiliar with taints tolerations & longhorn btw
l
Thank you for responding. Longhorn v1.5.5 configured the key like that. But, maybe I’m specifying the toleration value itself incorrectly. I’ll check.
@salmon-doctor-9726 do you have any insights here … in regards to mostly my initial message? Thank you very much.
Hmm … if I have to detach all volumes to get Longhorn to tolerate some taints … is it all volumes at the same time … or can I “take it” instance manager by instance manager? And thereby better control how much downtime/degradation that xyz workload/s have? Thank you
s
cc @faint-sunset-36608
🙏 1
f
Hello @late-needle-80860. Can you take a look at my comment on the linked GitHub issue to see if it can help you? https://github.com/longhorn/longhorn/issues/6313#issuecomment-2101102260 If it cannot, can you provide a copy of your Helm values file? It may also be helpful to provide the output of the following command, as it can help us understand how longhorn-manager is responding (if at all) to your settings.
Copy code
kubectl logs -n longhorn-system -l app=longhorn-manager --tail=-1 | grep -i toleration
For example, while getting my bearings in the GitHub issue, I found the following log, which helped me understand why my
taint-toleration
setting (as configured incorrectly) wasn't being used.
Copy code
[longhorn-manager-fbc2n] time="2024-05-08T17:26:31Z" level=error msg="Failed to unmarshal customized default settings from yaml data taint-toleration: [map[effect:NoExecute key:test/test operator:Equal value:true]]\npriority-class: longhorn-critical, will give up using them" func=types.getDefaultSettingFromYAML file="setting.go:1502" error="yaml: did not find expected ',' or ']'"
> I tried upgrading Longhorn from v1.5.3 to v1.5.4 and at the same time configuring
taintToleration: "...." key in Longhorn defaultSettings.
… On this cluster ( running v1.29.3 Kubernetes ) - not all instance-managers restarted. There’s still a couple of instances left many many days old. Also, I did NOT see any workload failing because they lost their volumes. This is normal upgrade behavior. Even if engines are live upgraded, we cannot move them to a new instance-manager while they are running. Your old instance-managers will exist until all engines running in them have been stopped (over the normal course of time). It probably indicates you did not specify the
taintToleration
/
taint-toleration
setting in a way that Longhorn recognizes. If you did, it WOULD have killed all your instance-manager pods. We DO NOT recommend applying this setting while volumes are running.
l
hmm thank you @faint-sunset-36608. I configured this:
Copy code
defaultSettings:
  concurrentAutomaticEngineUpgradePerNodeLimit: 3
  createDefaultDiskLabeledNodes: true
  orphanAutoDeletion: true
  priorityClass: system-node-critical
  replicaReplenishmentWaitInterval: 300
  taintToleration: "BeingBootstrapped=true:NoExecute"
In my Longhorn Helm Values file. And yes I don’t see the setting in the
Longhorn UI
. However I do see it in the Longhorn settings
ConfigMap
in the
longhorn-system namespace
. I also tried:
taintToleration: "BeingBootstrapped:NoExecute"
as the value. Same result. The other settings come through. So this is weird. Checking the logs of the longhorn-manager certainly makes me wiser 😄. I see this:
Copy code
time="2024-05-07T18:22:28Z" level=warning msg="Invalid customized default setting taint-toleration with value BeingBootstrapped=true:NoExecute, will continue applying other customized settings" func="datastore.(*DataStore).filterCustomizedDefaultSettings" file="longhorn.go:109" error="failed to set the setting taint-toleration with invalid value BeingBootstrapped=true:NoExecute: current state prevents this: cannot modify toleration setting before all volumes are detached"
So the message. Does it mean that: • the toleration value itself is incorrect? • and further it could not be applied because all volume has to be detached If the toleration value is incorrectly declared. What’s the correct way of declaring it? Thank you very much.
f
You hit these blocks of code: https://github.com/longhorn/longhorn-manager/blob/88074ee747e60d8edcd98072a793c8614d7bff86/datastore/longhorn.go#L108-L111 and https://github.com/longhorn/longhorn-manager/blob/88074ee747e60d8edcd98072a793c8614d7bff86/datastore/longhorn.go#L321-L328. In your version of Longhorn, we refuse to actually apply this setting while there are attached volumes. So it exists in the
longhorn-default-settings
ConfigMap, and occasionally a controller tries to "copy" it to the applied settings (which would appear in the UI), but we log that instead. This behavior is changed in https://github.com/longhorn/longhorn/issues/7173 such that the setting will be copied to the applied settings and then applied lazily over time. I think Longhorn has NOT validated the syntax of your setting yet, but it looks correct to me. The setting will only sync after ALL volumes detach, unfortunately.
l
Okay all makes sense now! Thank you. So if I upgraded to v1.6+ of Longhorn it can migrate lazily over time … when e.g. an instance manager is restarted for whatever reason. New workers are introduced in the cluster and old ones decommisioned?
f
Yes, it should work as you describe.
🙏 1
🦜 1
❤️ 1
l
Thank you.
1