I have a cluster on v1 5 5 of Longhorn I did not set the `ta Rancher Users #longhorn-storage

I have a cluster on v1.5.5 of Longhorn. I did not ...

late-needle-80860

05/06/2024, 8:07 PM

I have a cluster on v1.5.5 of Longhorn. I did not set the

taintToleration: "...."

key in the

Longhorn

defaultSettings

before upgrading to v1.5+. So now, when introducing new worker nodes in this cluster - the instance-manager does NOT come up. And in the describe body on an instance-manager it says:

TaintToleration failed

. Reading the docs I understand that it can lead to crashing volumes if changing the tolerations. Will this also be the case when proper tolerations are already on the longhorn- manager, driver and ui? So basically I’m hoping to be able to only have to set the

taintToleration: "...."

key in

Longhorn

defaultSettings

. Is this possible? Thank you.

late-needle-80860

05/06/2024, 8:32 PM

I tried upgrading Longhorn from v1.5.3 to v1.5.4 and at the same time configuring

taintToleration: "...." key in Longhorn defaultSettings.

late-needle-80860

05/07/2024, 8:05 AM

Anyone?

powerful-librarian-10572

05/07/2024, 8:08 AM

did you check this ? https://github.com/longhorn/longhorn/issues/6313#issuecomment-2095231679

powerful-librarian-10572

05/07/2024, 8:08 AM

I am unfamiliar with taints tolerations & longhorn btw

late-needle-80860

05/07/2024, 8:12 AM

Thank you for responding. Longhorn v1.5.5 configured the key like that. But, maybe I’m specifying the toleration value itself incorrectly. I’ll check.

late-needle-80860

05/07/2024, 3:39 PM

@salmon-doctor-9726 do you have any insights here … in regards to mostly my initial message? Thank you very much.

late-needle-80860

05/07/2024, 7:42 PM

Hmm … if I have to detach all volumes to get Longhorn to tolerate some taints … is it all volumes at the same time … or can I “take it” instance manager by instance manager? And thereby better control how much downtime/degradation that xyz workload/s have? Thank you

salmon-doctor-9726

05/08/2024, 4:29 AM

cc @faint-sunset-36608

🙏 1

faint-sunset-36608

05/08/2024, 5:56 PM

Hello @late-needle-80860. Can you take a look at my comment on the linked GitHub issue to see if it can help you? https://github.com/longhorn/longhorn/issues/6313#issuecomment-2101102260 If it cannot, can you provide a copy of your Helm values file? It may also be helpful to provide the output of the following command, as it can help us understand how longhorn-manager is responding (if at all) to your settings.

Copy code

kubectl logs -n longhorn-system -l app=longhorn-manager --tail=-1 | grep -i toleration

For example, while getting my bearings in the GitHub issue, I found the following log, which helped me understand why my

taint-toleration

setting (as configured incorrectly) wasn't being used.

Copy code

[longhorn-manager-fbc2n] time="2024-05-08T17:26:31Z" level=error msg="Failed to unmarshal customized default settings from yaml data taint-toleration: [map[effect:NoExecute key:test/test operator:Equal value:true]]\npriority-class: longhorn-critical, will give up using them" func=types.getDefaultSettingFromYAML file="setting.go:1502" error="yaml: did not find expected ',' or ']'"

faint-sunset-36608

05/08/2024, 6:02 PM

> I tried upgrading Longhorn from v1.5.3 to v1.5.4 and at the same time configuring

taintToleration: "...." key in Longhorn defaultSettings.

… On this cluster ( running v1.29.3 Kubernetes ) - not all instance-managers restarted. There’s still a couple of instances left many many days old. Also, I did NOT see any workload failing because they lost their volumes. This is normal upgrade behavior. Even if engines are live upgraded, we cannot move them to a new instance-manager while they are running. Your old instance-managers will exist until all engines running in them have been stopped (over the normal course of time). It probably indicates you did not specify the

taintToleration

taint-toleration

setting in a way that Longhorn recognizes. If you did, it WOULD have killed all your instance-manager pods. We DO NOT recommend applying this setting while volumes are running.

late-needle-80860

05/08/2024, 7:44 PM

hmm thank you @faint-sunset-36608. I configured this:

Copy code

defaultSettings:
  concurrentAutomaticEngineUpgradePerNodeLimit: 3
  createDefaultDiskLabeledNodes: true
  orphanAutoDeletion: true
  priorityClass: system-node-critical
  replicaReplenishmentWaitInterval: 300
  taintToleration: "BeingBootstrapped=true:NoExecute"

In my Longhorn Helm Values file. And yes I don’t see the setting in the

Longhorn UI

. However I do see it in the Longhorn settings

ConfigMap

in the

longhorn-system namespace

. I also tried:

taintToleration: "BeingBootstrapped:NoExecute"

as the value. Same result. The other settings come through. So this is weird. Checking the logs of the longhorn-manager certainly makes me wiser 😄. I see this:

Copy code

time="2024-05-07T18:22:28Z" level=warning msg="Invalid customized default setting taint-toleration with value BeingBootstrapped=true:NoExecute, will continue applying other customized settings" func="datastore.(*DataStore).filterCustomizedDefaultSettings" file="longhorn.go:109" error="failed to set the setting taint-toleration with invalid value BeingBootstrapped=true:NoExecute: current state prevents this: cannot modify toleration setting before all volumes are detached"

So the message. Does it mean that: • the toleration value itself is incorrect? • and further it could not be applied because all volume has to be detached If the toleration value is incorrectly declared. What’s the correct way of declaring it? Thank you very much.

faint-sunset-36608

05/08/2024, 8:29 PM

You hit these blocks of code: https://github.com/longhorn/longhorn-manager/blob/88074ee747e60d8edcd98072a793c8614d7bff86/datastore/longhorn.go#L108-L111 and https://github.com/longhorn/longhorn-manager/blob/88074ee747e60d8edcd98072a793c8614d7bff86/datastore/longhorn.go#L321-L328. In your version of Longhorn, we refuse to actually apply this setting while there are attached volumes. So it exists in the

longhorn-default-settings

ConfigMap, and occasionally a controller tries to "copy" it to the applied settings (which would appear in the UI), but we log that instead. This behavior is changed in https://github.com/longhorn/longhorn/issues/7173 such that the setting will be copied to the applied settings and then applied lazily over time. I think Longhorn has NOT validated the syntax of your setting yet, but it looks correct to me. The setting will only sync after ALL volumes detach, unfortunately.

late-needle-80860

05/08/2024, 8:46 PM

Okay all makes sense now! Thank you. So if I upgraded to v1.6+ of Longhorn it can migrate lazily over time … when e.g. an instance manager is restarted for whatever reason. New workers are introduced in the cluster and old ones decommisioned?

faint-sunset-36608

05/08/2024, 9:20 PM

Yes, it should work as you describe.

🙏 1

🦜 1

❤️ 1

late-needle-80860

05/08/2024, 9:49 PM

Thank you.

✅ 1

41 Views

Open in Slack

Previous Next