This message was deleted.
# rke2
a
This message was deleted.
l
Not sure if related but it seems to match the time a reboot of one of the master node.
I see
Failed to get recorded learner progress from etcd: context deadline exceeded
on the master that restarted 🤔
m
do you have two interfaces on your machines? https://github.com/rancher/rke2/issues/2606
l
Thanks for the link @miniature-notebook-6405. Nope, I'm not in this situation
From the logs, what I understand so far is that: • one master node went down for some reason, it took a few secs (maybe minutes but not that long) for the service to be back up • this caused the worker nodes to go in
NotReady
but they didn't "recover" event several hours after the master node being back. And anyway there were other master nodes perfectly fine.
I wonder if this can be related to etcd quorum or something like that, as we have 3 master nodes 🤔
Still not clear why it wouldn't go back to normal by itself
m
one control node went down... have you ensured swap is off on all these machines?
l
swap is off on all machines
(For the context, I'm more a dev than an ops/sysadmin, just trying to help my teammates here but I might sound silly in some of my question, do not hesitate to state the obvious 🙂 )
m
I'm dyed-in-the-wool ops but job title says dev
😅 1
this is something that the much-hyped control loop architecture is clearly not healing
perhaps nagios can detect and restart your agents, it can do things these days, that would be legit control loop 🙂
l
Yeah 1st thing we're gonna do is add the missing monitoring/alerting on this and setup some automation. But I'd like to understand root cause.
Especially if it happens regularly
Thanks for the help anyway!
m
sure thanks for posting
I lost a control node once, was massive numbers of dead pods generated from core dns autoscaler malfunction, so check for dead pod accumulation. Had to rebuild that node. Well these were all in one rke1 nodes.