This message was deleted Rancher Users #general

Join Slack

This message was deleted.

# general

adamant-kite-43734

01/23/2025, 7:08 PM

This message was deleted.

creamy-pencil-82913

01/23/2025, 7:21 PM

what do you mean by “reprovisioning”. It should just be reconfiguring the nodes with the new version and restarting them. Not as in building completely new nodes.

nice-businessperson-14225

01/23/2025, 7:26 PM

I do mean nodes are being created and deleted

creamy-pencil-82913

01/23/2025, 7:29 PM

How are you deploying your cluster?

nice-businessperson-14225

01/23/2025, 7:29 PM

When modifying the cluster config, there's a banner warning that modifications may cause reprovisioning. But in our case it happens more often than not.

nice-businessperson-14225

01/23/2025, 7:29 PM

that's where I took the "reprovisioning" nomenclature from

nice-businessperson-14225

01/23/2025, 7:30 PM

The clusters are being deployed through Rancher on Bare EC2

creamy-pencil-82913

01/23/2025, 7:30 PM

If you are using the provisioning framework built in to Rancher it should just modify them in-place. If you’re using CAPI then yeah, CAPI does not support in-place upgrades and it will replace nodes to upgrade them.

creamy-pencil-82913

01/23/2025, 7:31 PM

Clusters -> Create -> RKE2/K3s selected -> Amazon EC2?

creamy-pencil-82913

01/23/2025, 7:32 PM

Also, what version of Rancher are you using?

nice-businessperson-14225

01/23/2025, 7:33 PM

I will double check for your first question. Thanks.

nice-businessperson-14225

01/23/2025, 7:33 PM

Rancher is 2.10.1

nice-businessperson-14225

01/23/2025, 7:33 PM

but we've been having this issue since way before that

creamy-pencil-82913

01/23/2025, 7:36 PM

yeah there is no reason it should do that. I suspect either you’re changing more than just the version, or you’re not using the correct provisioning framework

creamy-pencil-82913

01/23/2025, 7:36 PM

both would be somewhat hard to do by accident though

creamy-pencil-82913

01/23/2025, 7:36 PM

are you using the UI, or some automation like tf/ansible?

nice-businessperson-14225

01/23/2025, 7:37 PM

Just the UI. We've seen it happen (full node reprovisioning) merely when adding a new cluster owner

nice-businessperson-14225

01/23/2025, 7:38 PM

I'm fairly positive we're not doing anything else by accident, this has happened to us in many different situations in different environments

nice-businessperson-14225

01/23/2025, 7:40 PM

I wasn't the one who created the clusters so I'm waiting for a response from my colleague, but this is what I could find. Is there any other way to find out how the cluster was created? One thing I'm positive is that it was created through the UI

creamy-pencil-82913

01/23/2025, 7:43 PM

And you are confident that it is actually deleting and recreating nodes and not just reconfiguring them in place?

creamy-pencil-82913

01/23/2025, 7:44 PM

it is normal to see it show the nodes as reprovisioning even if it is just updating their configuration in place.

nice-businessperson-14225

01/23/2025, 7:56 PM

Yes, a new node is created, and as soon as it becomes available/healthy, one of the old nodes is deleted

nice-businessperson-14225

01/23/2025, 7:56 PM

This almost wiped out out longhorn drives

creamy-pencil-82913

01/23/2025, 8:03 PM

yeah thats not great. I’m an rke2/k3s dev not rancher so I don’t know why it might do that, but I’ve poked some folks internally to ask.

nice-businessperson-14225

01/23/2025, 8:08 PM

Thanks, much appreciated! Please keep me in the loop about this if possible. In the meantime, do you think it's worthwhile to create an issue, or should I wait?

nice-businessperson-14225

01/24/2025, 1:56 PM

@creamy-pencil-82913 something we also noticed is that sometimes when nodes are created in this way, RKE2-related processes completely monopolize open files and prevent the node from working properly (can't deploy, logs failing to operate, etc). I included a report to showcase this. You can see that

/var/lib/rancher/rke2/data/v1.29.12-rke2r1-ee2e42023a73/bin/containerd-shim-runc-v2

is opened like a billion times (figuratively speaking). I don't know if these issues are related, but I thought I'd share in case they are.

inotify-debug.txt

creamy-pencil-82913

01/24/2025, 7:43 PM

yes, thats normal. there is a runc shim for each container. that is how containerd works.

creamy-pencil-82913

01/24/2025, 7:44 PM

https://www.suse.com/support/kb/doc/?id=000020048

creamy-pencil-82913

01/24/2025, 7:45 PM

https://github.com/containerd/containerd/blob/main/core/runtime/v2/README.md#shimengine-architecture for more info on the runc shim

creamy-pencil-82913

01/24/2025, 7:46 PM

For example, the
io.containerd.runc.v2
shim automatically groups based on the presence of labels. In practice, this means that containers launched by Kubernetes, that are part of the same Kubernetes pod, are handled by a single shim, grouping on the
io.kubernetes.cri.sandbox-id
label set by the CRI plugin.

nice-businessperson-14225

01/24/2025, 7:46 PM

Yesterday (and in the past too) we've had a neuvector enforcer fail to deploy on a node, and its logs "open inotify fail - error=too many open files"

nice-businessperson-14225

01/24/2025, 7:46 PM

Is this expected then, or is there something fishy

creamy-pencil-82913

01/24/2025, 7:47 PM

sounds like you should increase the inotify limits then

creamy-pencil-82913

01/24/2025, 7:47 PM

some distros ship with defaults that are not tuned for Kubernetes 🤷

nice-businessperson-14225

01/24/2025, 7:47 PM

I see. We are using Ubuntu, so I guess this is possible,

nice-businessperson-14225

01/24/2025, 7:48 PM

It does make me wonder though, couldn't that be something that Rancher/RKE2 configures if it detects a value that is too low?

creamy-pencil-82913

01/24/2025, 7:48 PM

in particular if you are loading on additional things that use inotify (like security agents) you are more likely to need to increase it. all depends on usage.

creamy-pencil-82913

01/24/2025, 7:49 PM

there is not a single value that is perfect for every workload or every node size.

nice-businessperson-14225

01/24/2025, 7:50 PM

Maybe a warning then? We're running fairly basic workloads without any crazy config, seems like something a lot of people might encounter. We basically have Grafana stuff, Neuvector, and a couple of user workloads.

nice-businessperson-14225

01/24/2025, 7:50 PM

(to be clear I'm not trying to criticize anything, I appreciate your help. Just curious)

nice-businessperson-14225

01/24/2025, 7:51 PM

In any case, thanks. Glad there's a simple-ish fix to this if it happens again in the future.

creamy-pencil-82913

01/24/2025, 7:52 PM

You might start with this and see if you need to increase it further

Copy code

fs.inotify.max_user_instances=8192
fs.inotify.max_user_watches=524288

👍 1

nice-businessperson-14225

01/29/2025, 4:27 PM

Any news on this @creamy-pencil-82913?

nice-businessperson-14225

02/12/2025, 6:28 PM

https://rancher-users.slack.com/archives/C3ASABBD1/p1737662624675359?thread_ts=1737659281.410069&cid=C3ASABBD1 Have you received responses regarding this issue?

17 Views

Open in Slack

Previous Next