Hello all, I've just updated one of our clusters ...
# rke2
k
Hello all, I've just updated one of our clusters (using Rancher) from 1.31.3+rke2r1 to 1.31.7+rke2r1. With this update also comes an updated version of the rke2-coredns chart. Since this upgrade our coredns pods keep restarting every approximately 5 minutes with a crash. Unfortunatly we've had no luck in finding the root cause. Any one some idea on how to get a step closer to the root cause. See the snippet of the error we are seeing in coredns, just before it gets restarted;
c
I have never seen this before. What kind of hardware is this running on?
This month’s releases will package coredns v1.12.1. you can try adding helmchartconfig to bump to this image on your current release, or wait for releases to come out later this week and upgrade to that.
I don’t know that it’ll fix this, since I’ve not seen it before, but it’s worth trying.
k
This is running on a vSphere environment using pretty decent and recent hardware. Meanwhile we troubleshooted a bit more, and it looks like it is solved by first rollback the HelmChart back to the previous version, and afterwards updating is again to 1.12.0. Afterwards it didn't occurred anymore.
Weird is that recycling all nodes (both control-planes as all worker nodes) didn't solved this issue.
c
I wouldn’t expect it to. SIGILL is an abnormal crash, like hardware fault or memory corruption, it wouldn’t be caused by anything in the system configuraiton.
👍 1