This message was deleted.
# rke2
a
This message was deleted.
c
we haven’t made any changes to the channels since last month’s releases. In the case of an outage it will just fail to resolve the latest version, it shouldn’t ever downgrade it. Do you have SUC logs available? From the controller, not the upgrade jobs.
s
I do have the logs, but it basically only logs that I can’t contact the K8s API during the upgrade. It says nothing about what it’s actually doing. If you have a suggestion for a config change to make it do some more logging, I’d love to have it.
c
I believe even at the normal log level it should log when it polls the channel to resolve the version? I honestly can’t remember though. I’ve never seen the SUC randomly downgrade nodes though, I really suspect something else is going on.
s
The only logs I have that aren’t from it failing to talk to K8s are:
Copy code
time="2023-06-12T08:00:56Z" level=error msg="error syncing 'system-upgrade/apply-server-plan-on-mynode-with-60076d09e16f1f7be0af09e7-ab182': handler system-upgrade-controller: jobs.batch \"apply-server-plan-on-mynode-with-60076d09e16f1f7be0af09e7-ab182\" not found, requeuing"
Which suggests it was doing something, but not exactly what.
c
did the controller pod get restarted around that time? that’s normally what I see when it’s starting and the caches haven’t been sync’d yet.
also, what version of the SUC are you running?
s
Nope, it’s been running since November…
rancher/system-upgrade-controller:v0.9.1
c
that’s a bit old, you might try upgrading to 0.11.0 but even on 0.9 I still don’t have any idea what would cause it to do what you’re describing.
do you have logs from the rke2-server journald log to show the downgrade occurring?
s
It’s reassuring you’re as stumped as I am 😂
I have it saying the version when it starts up
Copy code
journalctl -u rke2-server.service --since='2023-05-31 00:00:00' -g 'Starting rke2'
-- Journal begins at Sun 2023-01-22 00:00:03 GMT, ends at Wed 2023-06-14 21:48:02 BST. --
May 31 02:43:45 <http://mynode.example.com|mynode.example.com> rke2[2095329]: time="2023-05-31T02:43:45+01:00" level=info msg="Starting rke2 v1.25.10+rke2r1 (e0c376c606754f1ae6a1c2401f4f6e9146bda0f3)"
Jun 12 08:29:08 <http://mynode.example.com|mynode.example.com> rke2[443313]: time="2023-06-12T08:29:08+01:00" level=info msg="Starting rke2 v1.25.9+rke2r1 (842d05e64bcbf78552f1db0b32700b8faea403a0)"
Jun 12 08:44:02 <http://mynode.example.com|mynode.example.com> rke2[477981]: time="2023-06-12T08:44:02+01:00" level=info msg="Starting rke2 v1.25.10+rke2r1 (e0c376c606754f1ae6a1c2401f4f6e9146bda0f3)"
c
which channel are you pointed at? If its stable, we have it pinned here and it hasn’t changed in a bit: https://github.com/rancher/rke2/blob/master/channels.yaml#LL3C26-L3C26
s
Copy code
channel: <https://update.rke2.io/v1-release/channels/v1.25>
c
only thing I can think of is a GH outage that caused the channel server not to see that release for a bit? It caches them though so the timing would be hard to pin down.
I don’t have access to the channel server logs, unfortunately
s
Hmm yes I see how that could happen; nothing in GitHub’s incident history for that exact time window but Pages and Action did fall over later that day
In fact I’ve found another cluster that did the same thing, also v1.25
127 Views