This message was deleted.
# general
a
This message was deleted.
b
I'm using: • "local"/"ebi" rke2, latest, 3-node (all roles) cluster in HA mode, which hosts AWX and Rancher and nothing else • "phosphophyllite" rke2, 19 worker nodes, 7 control plane nodes in HA mode, which hosts various web apps and services and some PostgreSQL databases local's Rancher provisioned phosphophyllite and manages it I noticed when one of EBI's nodes had a CPU soft lockup earlier tonight, it caused Rancher to go 503 Service Unavailable, which is, y'know, understandable
but at the same time I went to a few sites hosted on Phos and noticed they were ALSO 503ing
they came back to life as soon as I kicked the EBI/rancher node that was having CPU soft lockups
wondering how I can prevent future outages as Rancher should never take down production
I did check that turning off the "local"/"ebi" cluster entirely does not cause or reproduce the issue - everything keeps running fine without Rancher around
b
That’s strange. Nothing in the downstream cluster should depend on the rancher cluster 🤔
What was the state of workload pods during the time apps were giving 503?
And I assume everything righted itself after you corrected the rancher mgmt cluster situation?
b
Everything corrected itself once I hard rebooted the misbehaving node in "local"/"ebi"
I have no idea what the state of the workload pods was during the 503, unfortunately; was kinda panicking and trying to fix it 😅
but basically I checked a bunch of independent pods in different namespaces and they were all 503ing
b
That's very strange! If you are collecting logs, I'd be interested to know what was going on with them to 503. I don't have any real ideas except some kind of common dependency that might've affected both the rancher node and your deployments - but the hard reboot fix tends to negate that idea.