This message was deleted Rancher Users #rke2

Join Slack

This message was deleted.

# rke2

adamant-kite-43734

05/06/2025, 2:41 PM

This message was deleted.

hundreds-evening-84071

05/06/2025, 2:47 PM

For prod environment, I always drain/cordon worker node before RKE2 or OS maintenance to avoid situation like this

eager-refrigerator-66976

05/06/2025, 2:52 PM

managed rke2 clusters are getting upgraded within few minutes, in large clusters with hundreds of nodes I don't have an option to cordon/drain all workers. 1. it is expensive as I will have to provision extra workers to migrate workloads to 2. it would take forever to upgrade cluster this way.

eager-refrigerator-66976

05/06/2025, 2:54 PM

maybe there is an option to upgrade containerd only on new workers instead of inplace upgrade on running worker.

eager-refrigerator-66976

05/06/2025, 2:54 PM

cc @creamy-pencil-82913 any idea if this is possible? 🙏

mysterious-animal-29850

05/06/2025, 3:08 PM

You can configure upgrade

Plan

using node selectors and labels to do a few at a time. https://docs.rke2.io/upgrades/automated_upgrade

mysterious-animal-29850

05/06/2025, 3:09 PM

Normally the upgrades don't cause an issue, but I did notice with some CNIs the upgrade from containerd 1.7 to 2.0 in k8s 1.30.x to 1.31.x have networking routing failures. Noticed it with Antrea CNI mainly.

eager-refrigerator-66976

05/06/2025, 3:23 PM

yeah I had to "decouple" Cilium CNI from RKE2 bundle and install it separately to make sure I am not upgrading Cilium and RKE2 at the same time as I had an incident caused by this. So I have decided to keep Cilium with it's own release cycle. I would love to do same for Containerd but since it is managed process by rke2 I don't see a way how to achieve this.

eager-refrigerator-66976

05/06/2025, 3:25 PM

Also I am using managed by rancher RKE2 clusters

eager-refrigerator-66976

05/07/2025, 9:51 AM

I have found something. https://github.com/containerd/containerd/issues/11511 looks like this is general containerd issue.

eager-refrigerator-66976

05/07/2025, 9:52 AM

Still wondering if there is any way to avoid containerd restart while doing rke2 upgrade? 🙏

creamy-pencil-82913

05/07/2025, 5:01 PM

no. I’ve not generally seen problems with this. Is there something unique about your environment that makes the restarts especially slow?

creamy-pencil-82913

05/07/2025, 5:01 PM

if you have an application that blocks on the stdout/stderr pipe getting full you might consider reconfiguring it to write to log files instead?

creamy-pencil-82913

05/07/2025, 5:06 PM

ref: https://github.com/containerd/containerd/issues/11150

creamy-pencil-82913

05/07/2025, 5:08 PM

note that the linux pipe buffer size is usually 64k so if your application writes more than that during the restart, the write will block

eager-refrigerator-66976

05/08/2025, 8:24 AM

containerd restarts pretty fast, however for some very busy services it is enough to block hundreds of requests which triggers circuit breakers 😕 and that has sort of snowballing effect for upstream / downstream services

creamy-pencil-82913

05/08/2025, 4:07 PM

blocking requests on logging is not great application behavior :/ I would probably suggest refactoring the app logging pipeline.

eager-refrigerator-66976

05/08/2025, 6:42 PM

yeah, that was exactly my feedback to product engineers too, if they need consistent logging they should not rely on stdout. However I think we have experienced issues even with some proxies (apache 👴) availability, I will have to try to reproduce that, tho to make sure it was affected by the same issue.

15 Views

Open in Slack

Previous Next