late-needle-80860
05/06/2024, 8:07 PMtaintToleration: "...."
key in the Longhorn
defaultSettings
before upgrading to v1.5+. So now, when introducing new worker nodes in this cluster - the instance-manager does NOT come up. And in the describe body on an instance-manager it says: TaintToleration failed
.
Reading the docs I understand that it can lead to crashing volumes if changing the tolerations. Will this also be the case when proper tolerations are already on the longhorn- manager, driver and ui? So basically I’m hoping to be able to only have to set the taintToleration: "...."
key in Longhorn
defaultSettings
.
Is this possible? Thank you.late-needle-80860
05/06/2024, 8:32 PMtaintToleration: "...." key in Longhorn defaultSettings.
… On this cluster ( running v1.29.3 Kubernetes ) - not all instance-managers restarted. There’s still a couple of instances left many many days old. Also, I did NOT see any workload failing because they lost their volumes.
Ideas and comments are highly appreciated. Thank you.late-needle-80860
05/07/2024, 8:05 AMpowerful-librarian-10572
05/07/2024, 8:08 AMpowerful-librarian-10572
05/07/2024, 8:08 AMlate-needle-80860
05/07/2024, 8:12 AMlate-needle-80860
05/07/2024, 3:39 PMlate-needle-80860
05/07/2024, 7:42 PMsalmon-doctor-9726
05/08/2024, 4:29 AMfaint-sunset-36608
05/08/2024, 5:56 PMkubectl logs -n longhorn-system -l app=longhorn-manager --tail=-1 | grep -i toleration
For example, while getting my bearings in the GitHub issue, I found the following log, which helped me understand why my taint-toleration
setting (as configured incorrectly) wasn't being used.
[longhorn-manager-fbc2n] time="2024-05-08T17:26:31Z" level=error msg="Failed to unmarshal customized default settings from yaml data taint-toleration: [map[effect:NoExecute key:test/test operator:Equal value:true]]\npriority-class: longhorn-critical, will give up using them" func=types.getDefaultSettingFromYAML file="setting.go:1502" error="yaml: did not find expected ',' or ']'"
faint-sunset-36608
05/08/2024, 6:02 PMtaintToleration: "...." key in Longhorn defaultSettings.
… On this cluster ( running v1.29.3 Kubernetes ) - not all instance-managers restarted. There’s still a couple of instances left many many days old. Also, I did NOT see any workload failing because they lost their volumes.
This is normal upgrade behavior. Even if engines are live upgraded, we cannot move them to a new instance-manager while they are running. Your old instance-managers will exist until all engines running in them have been stopped (over the normal course of time).
It probably indicates you did not specify the taintToleration
/ taint-toleration
setting in a way that Longhorn recognizes. If you did, it WOULD have killed all your instance-manager pods. We DO NOT recommend applying this setting while volumes are running.late-needle-80860
05/08/2024, 7:44 PMdefaultSettings:
concurrentAutomaticEngineUpgradePerNodeLimit: 3
createDefaultDiskLabeledNodes: true
orphanAutoDeletion: true
priorityClass: system-node-critical
replicaReplenishmentWaitInterval: 300
taintToleration: "BeingBootstrapped=true:NoExecute"
In my Longhorn Helm Values file. And yes I don’t see the setting in the Longhorn UI
. However I do see it in the Longhorn settings ConfigMap
in the longhorn-system namespace
.
I also tried: taintToleration: "BeingBootstrapped:NoExecute"
as the value. Same result. The other settings come through. So this is weird.
Checking the logs of the longhorn-manager certainly makes me wiser 😄. I see this:
time="2024-05-07T18:22:28Z" level=warning msg="Invalid customized default setting taint-toleration with value BeingBootstrapped=true:NoExecute, will continue applying other customized settings" func="datastore.(*DataStore).filterCustomizedDefaultSettings" file="longhorn.go:109" error="failed to set the setting taint-toleration with invalid value BeingBootstrapped=true:NoExecute: current state prevents this: cannot modify toleration setting before all volumes are detached"
So the message. Does it mean that:
• the toleration value itself is incorrect?
• and further it could not be applied because all volume has to be detached
If the toleration value is incorrectly declared. What’s the correct way of declaring it?
Thank you very much.faint-sunset-36608
05/08/2024, 8:29 PMlonghorn-default-settings
ConfigMap, and occasionally a controller tries to "copy" it to the applied settings (which would appear in the UI), but we log that instead. This behavior is changed in https://github.com/longhorn/longhorn/issues/7173 such that the setting will be copied to the applied settings and then applied lazily over time.
I think Longhorn has NOT validated the syntax of your setting yet, but it looks correct to me. The setting will only sync after ALL volumes detach, unfortunately.late-needle-80860
05/08/2024, 8:46 PMfaint-sunset-36608
05/08/2024, 9:20 PMlate-needle-80860
05/08/2024, 9:49 PM