This message was deleted Rancher Users #rke2

Join Slack

This message was deleted.

# rke2

adamant-kite-43734

03/13/2024, 12:05 PM

This message was deleted.

limited-motherboard-41807

03/13/2024, 12:10 PM

Not sure if related but it seems to match the time a reboot of one of the master node.

limited-motherboard-41807

03/13/2024, 12:13 PM

I see

Failed to get recorded learner progress from etcd: context deadline exceeded

on the master that restarted 🤔

miniature-notebook-6405

03/13/2024, 2:13 PM

do you have two interfaces on your machines? https://github.com/rancher/rke2/issues/2606

limited-motherboard-41807

03/13/2024, 2:28 PM

Thanks for the link @miniature-notebook-6405. Nope, I'm not in this situation

limited-motherboard-41807

03/13/2024, 2:29 PM

From the logs, what I understand so far is that: • one master node went down for some reason, it took a few secs (maybe minutes but not that long) for the service to be back up • this caused the worker nodes to go in

NotReady

but they didn't "recover" event several hours after the master node being back. And anyway there were other master nodes perfectly fine.

limited-motherboard-41807

03/13/2024, 2:30 PM

I wonder if this can be related to etcd quorum or something like that, as we have 3 master nodes 🤔

limited-motherboard-41807

03/13/2024, 2:30 PM

Still not clear why it wouldn't go back to normal by itself

miniature-notebook-6405

03/13/2024, 2:32 PM

one control node went down... have you ensured swap is off on all these machines?

limited-motherboard-41807

03/13/2024, 2:34 PM

swap is off on all machines

limited-motherboard-41807

03/13/2024, 2:35 PM

(For the context, I'm more a dev than an ops/sysadmin, just trying to help my teammates here but I might sound silly in some of my question, do not hesitate to state the obvious 🙂 )

miniature-notebook-6405

03/13/2024, 2:36 PM

I'm dyed-in-the-wool ops but job title says dev

😅 1

miniature-notebook-6405

03/13/2024, 2:38 PM

this is something that the much-hyped control loop architecture is clearly not healing

miniature-notebook-6405

03/13/2024, 2:40 PM

perhaps nagios can detect and restart your agents, it can do things these days, that would be legit control loop 🙂

limited-motherboard-41807

03/13/2024, 2:41 PM

Yeah 1st thing we're gonna do is add the missing monitoring/alerting on this and setup some automation. But I'd like to understand root cause.

limited-motherboard-41807

03/13/2024, 2:41 PM

Especially if it happens regularly

limited-motherboard-41807

03/13/2024, 2:41 PM

Thanks for the help anyway!

miniature-notebook-6405

03/13/2024, 2:42 PM

sure thanks for posting

miniature-notebook-6405

03/13/2024, 2:46 PM

I lost a control node once, was massive numbers of dead pods generated from core dns autoscaler malfunction, so check for dead pod accumulation. Had to rebuild that node. Well these were all in one rke1 nodes.

2 Views

Open in Slack

Previous Next