This message was deleted.
# rke2
a
This message was deleted.
c
just restarting them doesn’t work? Unless the datastore got corrupted there’s no reason you should need to restore…
a
No, I restarted them one by one and then they showed up as Unavailable in Rancher, both in the cluster view and cluster management view, like you'd expect to see if they were down. But I could ssh to them once they were back up and the etcd container was running. I was away on vacation last week and came back to the cluster broken Monday morning. The other dev on my team ran updates on all of our servers on Friday, since our test instances only have 4G of RAM I'm pretty sure it hung them all at the same time. A couple control planes were in the same state and needed a reboot. This is the most recent line in one of the etcd container logs
Copy code
{
  "level": "warn",
  "ts": "2024-08-22T01:14:26.354997Z",
  "caller": "etcdserver/server.go:2085",
  "msg": "failed to publish local member to cluster through raft",
  "local-member-id": "7ebc27a47a696333",
  "local-member-attributes": "{Name:ip-10-114-49-88.ec2.internal-e975046f ClientURLs:[<https://10.114.49.88:2379>]}",
  "request-path": "/0/members/7ebc27a47a696333/attributes",
  "publish-timeout": "15s",
  "error": "etcdserver: request timed out"
}
It's just our test cluster, but I'd like to be able to recover from this in case something similar happens in prod
weird, a couple more reboots and one of them came back up and is happy in the cluster. The other two still show as down, when I ssh to them and run
kubectl get nodes
from either of those 2 I see this
Copy code
kubectl get nodes
E0822 01:38:20.050581    7582 memcache.go:265] couldn't get current server API group list: Get "<https://127.0.0.1:6443/api?timeout=32s>": net/http: TLS handshake timeout
E0822 01:38:30.052144    7582 memcache.go:265] couldn't get current server API group list: Get "<https://127.0.0.1:6443/api?timeout=32s>": net/http: TLS handshake timeout
I'm trying to go through some steps here https://ranchermanager.docs.rancher.com/troubleshooting/kubernetes-components/troubleshooting-etcd-nodes and I'm getting
Error: context deadline exceeded
when trying to run the etcdctl commands in the etcd pods, which I know indicates the etcd instance is unhealthy. Makes sense, only one ectd node appears to behaving properly which means it doesn't have quorum
c
Did their IPs change or something?
a
No. I can see in the etcd pod logs of the one good etcd server that's it's trying to talk to the other two.
Copy code
{
  "level": "warn",
  "ts": "2024-08-22T01:56:44.770927Z",
  "caller": "etcdserver/cluster_util.go:288",
  "msg": "failed to reach the peer URL",
  "address": "<https://10.114.49.88:2380/version>",
  "remote-member-id": "7ebc27a47a696333",
  "error": "Get \"<https://10.114.49.88:2380/version>\": dial tcp 10.114.49.88:2380: connect: connection refused"
}
And then here's a snippet from the etcd pod log that's running on 10.114.49.88 that the good etcd node can't connect to. I restarted this one about 10 minutes ago
Copy code
{"level":"info","ts":"2024-08-22T01:56:48.519303Z","caller":"embed/serve.go:103","msg":"ready to serve client requests"}
{"level":"info","ts":"2024-08-22T01:56:48.520298Z","caller":"embed/serve.go:250","msg":"serving client traffic securely","traffic":"http","address":"127.0.0.1:2382"}
{"level":"info","ts":"2024-08-22T01:56:48.520358Z","caller":"embed/serve.go:103","msg":"ready to serve client requests"}
{"level":"info","ts":"2024-08-22T01:56:48.521483Z","caller":"embed/serve.go:250","msg":"serving client traffic securely","traffic":"grpc","address":"127.0.0.1:2379"}
{"level":"info","ts":"2024-08-22T01:56:48.522852Z","caller":"embed/serve.go:103","msg":"ready to serve client requests"}
{"level":"info","ts":"2024-08-22T01:56:48.523995Z","caller":"embed/serve.go:250","msg":"serving client traffic securely","traffic":"grpc","address":"10.114.49.88:2379"}
{"level":"info","ts":"2024-08-22T01:56:48.525162Z","caller":"etcdmain/main.go:44","msg":"notifying init daemon"}
{"level":"info","ts":"2024-08-22T01:56:48.525183Z","caller":"etcdmain/main.go:50","msg":"successfully notified init daemon"}
honestly I don't really know what I'm looking at there but that portion seemed relevant
I found a bad certificate error in the logs for the good etcd node. This is complaining about one of the bad nodes.
Copy code
{"level":"warn","ts":"2024-08-22T01:56:20.758795Z","caller":"etcdserver/cluster_util.go:155","msg":"failed to get version","remote-member-id":"7ebc27a47a696333","error":"Get \"<https://10.114.49.88:2380/version>\": dial tcp 10.114.49.88:2380: connect: connection refused"}
{"level":"warn","ts":"2024-08-22T01:56:20.880189Z","caller":"embed/config_logging.go:169","msg":"rejected connection","remote-addr":"10.114.49.88:55648","server-name":"","error":"remote error: tls: bad certificate"}
{"level":"warn","ts":"2024-08-22T01:56:20.880905Z","caller":"embed/config_logging.go:169","msg":"rejected connection","remote-addr":"10.114.49.88:55658","server-name":"","error":"remote error: tls: bad certificate"}
{"level":"warn","ts":"2024-08-22T01:56:20.892152Z","caller":"embed/config_logging.go:169","msg":"rejected connection","remote-addr":"10.114.49.88:55686","server-name":"","error":"remote error: tls: bad certificate"}
{"level":"warn","ts":"2024-08-22T01:56:20.892637Z","caller":"embed/config_logging.go:169","msg":"rejected connection","remote-addr":"10.114.49.88:55672","server-name":"","error":"remote error: tls: bad certificate"}
{"level":"warn","ts":"2024-08-22T01:56:20.932421Z","caller":"embed/config_logging.go:169","msg":"rejected connection","remote-addr":"10.114.49.88:55700","server-name":"","error":"remote error: tls: bad certificate"}
{"level":"warn","ts":"2024-08-22T01:56:20.994648Z","caller":"embed/config_logging.go:169","msg":"rejected connection","remote-addr":"10.114.49.88:55716","server-name":"","error":"remote error: tls: bad certificate"}
{"level":"warn","ts":"2024-08-22T01:56:20.994998Z","caller":"embed/config_logging.go:169","msg":"rejected connection","remote-addr":"10.114.49.88:55704","server-name":"","error":"remote error: tls: bad certificate"}
{"level":"warn","ts":"2024-08-22T01:56:21.102371Z","caller":"embed/config_logging.go:169","msg":"rejected connection","remote-addr":"10.114.49.88:55732","server-name":"","error":"remote error: tls: bad certificate"}
{"level":"warn","ts":"2024-08-22T01:56:21.103052Z","caller":"embed/config_logging.go:169","msg":"rejected connection","remote-addr":"10.114.49.88:55722","server-name":"","error":"remote error: tls: bad certificate"}
{"level":"warn","ts":"2024-08-22T01:56:21.268047Z","caller":"embed/config_logging.go:169","msg":"rejected connection","remote-addr":"10.114.49.88:55740","server-name":"","error":"read tcp 10.114.49.102:2380->10.114.49.88:55740: read: connection reset by peer"}
{"level":"warn","ts":"2024-08-22T01:56:21.269205Z","caller":"embed/config_logging.go:169","msg":"rejected connection","remote-addr":"10.114.49.88:55738","server-name":"","error":"read tcp 10.114.49.102:2380->10.114.49.88:55738: read: connection reset by peer"}
c
Did something happen where they lost data? Is the time set correctly? Nothing else weird going on?
Worst case scenario you can just stop them all, start one of them with
rke2 server --cluster-reset
to set etcd membership back to a single node only, then delete the db from the other two nodes and rejoin them to the cluster.
a
I have no idea what happened. A bunch of yum updates were run late last week when I was out. I came in Monday morning and saw the cluster was down. I noticed I had 3 new control planes as of Sunday morning. I looked at the control plane ASG and noticed it was constantly replacing unhealthy nodes starting around the time the updates were run on Friday. It kept doing that until Sunday morning when those instances were finally stable. I have the etcd instances in an ASG too, but I've suspended ReplaceUnhealthy Terminate because I definitely don't want those nodes terminated. I imagine whatever was causing the control plane instances to become unresponsive also happened to the etcd nodes, but I can't be certain.
I don't even know how this one etcd node is healthy, it was in the same state as the other two and I've rebooted it several times these past few days when I have a chance to look at. Then after a reboot a couple hours ago it was magically happy
c
Ohh well if you have no members left from the original cluster, it's very possible none of them have any of the original etcd data left, in which case yeah you'd want to restore from a snapshot
a
@creamy-pencil-82913 I still had a good etcd node running, so I stopped rke2-server on all 3 etcd nodes (2 bad 1 good), and then on the good one I ran the
rke2 server --cluster-reset
which gave an error about the server flag. So I deleted the server line from
/etc/rancher/rke2/config.yaml.d/50-rancher.yaml
and continued with the reset, which seemed to work. I tried deleting the DB on another etcd node and starting the rke2-server process, now my control planes say they're down. I suspect it's because that server setting is still in their rke2 configs. Do I need to delete that from the control plane and other etcd node configs too? That server is actually one of the bad ones that I'm trying to bring back online, I guess it was the initial bootstrap server for my cluster?
Should I have completed removed the 2 bad etcd nodes from the cluster in the cluster management view too? It's not in the instructions here so I didn't do that https://docs.rke2.io/backup_restore
I brought the cluster down to just one etcd node and ran the cluster reset. I ran into problems trying to rejoin the other two etcd nodes, so I just deleted them and brought in 2 new etcd nodes. I also had to manually update the "server" value on my control plane nodes in
/etc/rancher/rke2/config.yaml.d/50-rancher.yaml
to reflect the new etcd node, the value it had was an old etcd node. The docs don't mention that step, maybe it isn't necessary in some configurations but it definitely was in mine.
I also had to redeploy a few things in the cluster once it was back up with all of the etcd, control plane, and worker nodes. I don't remember exactly what I redeployed but I know it was at least the coredns and canal deployments. It was a good exercise, I'm glad I didn't just wipe it away and start from scratch, which was tempting since it's just a test cluster.
c
If you’re using rancher, you should do the node delete and restore from snapshot using rancher. Otherwise Rancher will try to reset the content of 50-rancher.yaml, and manage the service state, out from under you.
a
Good to know, thanks. I didn't have to restore from snapshot this time, since I still had one good etcd node luckily.
231 Views