This message was deleted Rancher Users #general

Join Slack

This message was deleted.

# general

adamant-kite-43734

06/20/2023, 9:02 PM

This message was deleted.

busy-crowd-80458

06/20/2023, 9:04 PM

I'm using: • "local"/"ebi" rke2, latest, 3-node (all roles) cluster in HA mode, which hosts AWX and Rancher and nothing else • "phosphophyllite" rke2, 19 worker nodes, 7 control plane nodes in HA mode, which hosts various web apps and services and some PostgreSQL databases local's Rancher provisioned phosphophyllite and manages it I noticed when one of EBI's nodes had a CPU soft lockup earlier tonight, it caused Rancher to go 503 Service Unavailable, which is, y'know, understandable

busy-crowd-80458

06/20/2023, 9:05 PM

but at the same time I went to a few sites hosted on Phos and noticed they were ALSO 503ing

busy-crowd-80458

06/20/2023, 9:05 PM

they came back to life as soon as I kicked the EBI/rancher node that was having CPU soft lockups

busy-crowd-80458

06/20/2023, 9:06 PM

wondering how I can prevent future outages as Rancher should never take down production

busy-crowd-80458

06/20/2023, 9:06 PM

I did check that turning off the "local"/"ebi" cluster entirely does not cause or reproduce the issue - everything keeps running fine without Rancher around

bulky-computer-31499

06/20/2023, 11:25 PM

That’s strange. Nothing in the downstream cluster should depend on the rancher cluster 🤔

bulky-computer-31499

06/20/2023, 11:26 PM

What was the state of workload pods during the time apps were giving 503?

bulky-computer-31499

06/20/2023, 11:27 PM

And I assume everything righted itself after you corrected the rancher mgmt cluster situation?

busy-crowd-80458

06/21/2023, 9:14 AM

Everything corrected itself once I hard rebooted the misbehaving node in "local"/"ebi"

busy-crowd-80458

06/21/2023, 9:15 AM

I have no idea what the state of the workload pods was during the 503, unfortunately; was kinda panicking and trying to fix it 😅

busy-crowd-80458

06/21/2023, 9:15 AM

but basically I checked a bunch of independent pods in different namespaces and they were all 503ing

bulky-computer-31499

06/21/2023, 1:56 PM

That's very strange! If you are collecting logs, I'd be interested to know what was going on with them to 503. I don't have any real ideas except some kind of common dependency that might've affected both the rancher node and your deployments - but the hard reboot fix tends to negate that idea.

Open in Slack

Previous Next