This message was deleted.
# general
a
This message was deleted.
c
what do you mean by “reprovisioning”. It should just be reconfiguring the nodes with the new version and restarting them. Not as in building completely new nodes.
n
I do mean nodes are being created and deleted
c
How are you deploying your cluster?
n
When modifying the cluster config, there's a banner warning that modifications may cause reprovisioning. But in our case it happens more often than not.
that's where I took the "reprovisioning" nomenclature from
The clusters are being deployed through Rancher on Bare EC2
c
If you are using the provisioning framework built in to Rancher it should just modify them in-place. If you’re using CAPI then yeah, CAPI does not support in-place upgrades and it will replace nodes to upgrade them.
Clusters -> Create -> RKE2/K3s selected -> Amazon EC2?
Also, what version of Rancher are you using?
n
I will double check for your first question. Thanks.
Rancher is 2.10.1
but we've been having this issue since way before that
c
yeah there is no reason it should do that. I suspect either you’re changing more than just the version, or you’re not using the correct provisioning framework
both would be somewhat hard to do by accident though
are you using the UI, or some automation like tf/ansible?
n
Just the UI. We've seen it happen (full node reprovisioning) merely when adding a new cluster owner
I'm fairly positive we're not doing anything else by accident, this has happened to us in many different situations in different environments
I wasn't the one who created the clusters so I'm waiting for a response from my colleague, but this is what I could find. Is there any other way to find out how the cluster was created? One thing I'm positive is that it was created through the UI
c
And you are confident that it is actually deleting and recreating nodes and not just reconfiguring them in place?
it is normal to see it show the nodes as reprovisioning even if it is just updating their configuration in place.
n
Yes, a new node is created, and as soon as it becomes available/healthy, one of the old nodes is deleted
This almost wiped out out longhorn drives
c
yeah thats not great. I’m an rke2/k3s dev not rancher so I don’t know why it might do that, but I’ve poked some folks internally to ask.
n
Thanks, much appreciated! Please keep me in the loop about this if possible. In the meantime, do you think it's worthwhile to create an issue, or should I wait?
@creamy-pencil-82913 something we also noticed is that sometimes when nodes are created in this way, RKE2-related processes completely monopolize open files and prevent the node from working properly (can't deploy, logs failing to operate, etc). I included a report to showcase this. You can see that
/var/lib/rancher/rke2/data/v1.29.12-rke2r1-ee2e42023a73/bin/containerd-shim-runc-v2
is opened like a billion times (figuratively speaking). I don't know if these issues are related, but I thought I'd share in case they are.
c
yes, thats normal. there is a runc shim for each container. that is how containerd works.
For example, the
io.containerd.runc.v2
shim automatically groups based on the presence of labels. In practice, this means that containers launched by Kubernetes, that are part of the same Kubernetes pod, are handled by a single shim, grouping on the
io.kubernetes.cri.sandbox-id
label set by the CRI plugin.
n
Yesterday (and in the past too) we've had a neuvector enforcer fail to deploy on a node, and its logs "open inotify fail - error=too many open files"
Is this expected then, or is there something fishy
c
sounds like you should increase the inotify limits then
some distros ship with defaults that are not tuned for Kubernetes 🤷
n
I see. We are using Ubuntu, so I guess this is possible,
It does make me wonder though, couldn't that be something that Rancher/RKE2 configures if it detects a value that is too low?
c
in particular if you are loading on additional things that use inotify (like security agents) you are more likely to need to increase it. all depends on usage.
there is not a single value that is perfect for every workload or every node size.
n
Maybe a warning then? We're running fairly basic workloads without any crazy config, seems like something a lot of people might encounter. We basically have Grafana stuff, Neuvector, and a couple of user workloads.
(to be clear I'm not trying to criticize anything, I appreciate your help. Just curious)
In any case, thanks. Glad there's a simple-ish fix to this if it happens again in the future.
c
You might start with this and see if you need to increase it further
Copy code
fs.inotify.max_user_instances=8192
fs.inotify.max_user_watches=524288
👍 1
n
Any news on this @creamy-pencil-82913?