:wave: Does anyone had issues with rke2 upgrades i...
# rke2
e
👋 Does anyone had issues with rke2 upgrades in production with restarting
containerd
would affect services ability to write logs to stdout which in my case affects running applications in a way so I am losing logs for that time while containerd was restarting and also in my case I have some applications that finish processing request if writing lot to stdout didn't work which cause downtime. Anyone had to deal with anything like this?
h
For prod environment, I always drain/cordon worker node before RKE2 or OS maintenance to avoid situation like this
e
managed rke2 clusters are getting upgraded within few minutes, in large clusters with hundreds of nodes I don't have an option to cordon/drain all workers. 1. it is expensive as I will have to provision extra workers to migrate workloads to 2. it would take forever to upgrade cluster this way.
maybe there is an option to upgrade containerd only on new workers instead of inplace upgrade on running worker.
cc @creamy-pencil-82913 any idea if this is possible? 🙏
m
You can configure upgrade
Plan
using node selectors and labels to do a few at a time. https://docs.rke2.io/upgrades/automated_upgrade
Normally the upgrades don't cause an issue, but I did notice with some CNIs the upgrade from containerd 1.7 to 2.0 in k8s 1.30.x to 1.31.x have networking routing failures. Noticed it with Antrea CNI mainly.
e
yeah I had to "decouple" Cilium CNI from RKE2 bundle and install it separately to make sure I am not upgrading Cilium and RKE2 at the same time as I had an incident caused by this. So I have decided to keep Cilium with it's own release cycle. I would love to do same for Containerd but since it is managed process by rke2 I don't see a way how to achieve this.
Also I am using managed by rancher RKE2 clusters
I have found something. https://github.com/containerd/containerd/issues/11511 looks like this is general containerd issue.
Still wondering if there is any way to avoid containerd restart while doing rke2 upgrade? 🙏
c
no. I’ve not generally seen problems with this. Is there something unique about your environment that makes the restarts especially slow?
if you have an application that blocks on the stdout/stderr pipe getting full you might consider reconfiguring it to write to log files instead?
note that the linux pipe buffer size is usually 64k so if your application writes more than that during the restart, the write will block
e
containerd restarts pretty fast, however for some very busy services it is enough to block hundreds of requests which triggers circuit breakers 😕 and that has sort of snowballing effect for upstream / downstream services
c
blocking requests on logging is not great application behavior :/ I would probably suggest refactoring the app logging pipeline.
e
yeah, that was exactly my feedback to product engineers too, if they need consistent logging they should not rely on stdout. However I think we have experienced issues even with some proxies (apache 👴) availability, I will have to try to reproduce that, tho to make sure it was affected by the same issue.