Hello, I'm having trouble with my RKE2 cluster.. i...
# rke2
r
Hello, I'm having trouble with my RKE2 cluster.. its was running fine but suddenly the entire control plane died (node still accessible), but I have no idea how to fix it. When I tried to start rke2-server.service, it results in a endless loop of it trying to defrag etcd (1.6 GB - I tried increasing the quota in the config.yaml but doesn't seem to apply) and context deadline exceeded and there were also some logs about failed to revoke lease. I tried to use the command
ctr -a  /run/k3s/containerd/containerd.sock
but it also result in context deadline exceeded. Does anyone know what might be the issue or how I could potentially fix it?
h
how many etcd/control plane nodes do you have ? what version of RKE2? have you looked at this doc? https://gist.github.com/superseb/3b78f47989e0dbc1295486c186e944bf
r
Hihi, I have 3 control plane nodes (embedded etcd), its running v1.23.5, and yes I have seen this doc, have ensured the binaries and images are there but some of the commands doesn't work for me. I can't
kubectl
to check some things because the master nodes are all dead.
I can't verify if any pods is running on the master node through ctr/crictl, but based on the context deadline exceeded, I can only assume the etcd are failing somehow, its just weird and random that this is happening only to master suddenly
h
then you will have to use
crictl
and
etcdctl
https://gist.github.com/superseb/3b78f47989e0dbc1295486c186e944bf#on-the-etcd-host-itself did you see
alarm:NOSPACE
with etcd?
journalctl -u rke2-server
r
I don't think so, I have also checked with
du -h
and I have more than sufficient space available on disk hmm
but when i ran
rke2 server --debug
, I saw like it was going to 127.0.0.1 and some port that is not 2379 or etcd kinda port number, then subsequently something like
Copy code
"Failed to test data store connection: context deadline exceeded"
I tried to do a "cluster-reset" on one of the control plane node as well, but when I tried to use the latest snapshot for restoration, it failed midway (didn't got a good look at why) but not doing restoration would work, not sure if it's possibly a corrupted etcd.