This message was deleted Rancher Users #rke2

Join Slack

This message was deleted.

# rke2

adamant-kite-43734

08/21/2024, 10:48 PM

This message was deleted.

creamy-pencil-82913

08/21/2024, 11:05 PM

just restarting them doesn’t work? Unless the datastore got corrupted there’s no reason you should need to restore…

abundant-hair-58573

08/22/2024, 1:16 AM

No, I restarted them one by one and then they showed up as Unavailable in Rancher, both in the cluster view and cluster management view, like you'd expect to see if they were down. But I could ssh to them once they were back up and the etcd container was running. I was away on vacation last week and came back to the cluster broken Monday morning. The other dev on my team ran updates on all of our servers on Friday, since our test instances only have 4G of RAM I'm pretty sure it hung them all at the same time. A couple control planes were in the same state and needed a reboot. This is the most recent line in one of the etcd container logs

Copy code

{
  "level": "warn",
  "ts": "2024-08-22T01:14:26.354997Z",
  "caller": "etcdserver/server.go:2085",
  "msg": "failed to publish local member to cluster through raft",
  "local-member-id": "7ebc27a47a696333",
  "local-member-attributes": "{Name:ip-10-114-49-88.ec2.internal-e975046f ClientURLs:[<https://10.114.49.88:2379>]}",
  "request-path": "/0/members/7ebc27a47a696333/attributes",
  "publish-timeout": "15s",
  "error": "etcdserver: request timed out"
}

abundant-hair-58573

08/22/2024, 1:18 AM

It's just our test cluster, but I'd like to be able to recover from this in case something similar happens in prod

abundant-hair-58573

08/22/2024, 1:39 AM

weird, a couple more reboots and one of them came back up and is happy in the cluster. The other two still show as down, when I ssh to them and run

kubectl get nodes

from either of those 2 I see this

Copy code

kubectl get nodes
E0822 01:38:20.050581    7582 memcache.go:265] couldn't get current server API group list: Get "<https://127.0.0.1:6443/api?timeout=32s>": net/http: TLS handshake timeout
E0822 01:38:30.052144    7582 memcache.go:265] couldn't get current server API group list: Get "<https://127.0.0.1:6443/api?timeout=32s>": net/http: TLS handshake timeout

abundant-hair-58573

08/22/2024, 1:45 AM

I'm trying to go through some steps here https://ranchermanager.docs.rancher.com/troubleshooting/kubernetes-components/troubleshooting-etcd-nodes and I'm getting

Error: context deadline exceeded

when trying to run the etcdctl commands in the etcd pods, which I know indicates the etcd instance is unhealthy. Makes sense, only one ectd node appears to behaving properly which means it doesn't have quorum

creamy-pencil-82913

08/22/2024, 2:02 AM

Did their IPs change or something?

abundant-hair-58573

08/22/2024, 2:10 AM

No. I can see in the etcd pod logs of the one good etcd server that's it's trying to talk to the other two.

Copy code

{
  "level": "warn",
  "ts": "2024-08-22T01:56:44.770927Z",
  "caller": "etcdserver/cluster_util.go:288",
  "msg": "failed to reach the peer URL",
  "address": "<https://10.114.49.88:2380/version>",
  "remote-member-id": "7ebc27a47a696333",
  "error": "Get \"<https://10.114.49.88:2380/version>\": dial tcp 10.114.49.88:2380: connect: connection refused"
}

And then here's a snippet from the etcd pod log that's running on 10.114.49.88 that the good etcd node can't connect to. I restarted this one about 10 minutes ago

Copy code

{"level":"info","ts":"2024-08-22T01:56:48.519303Z","caller":"embed/serve.go:103","msg":"ready to serve client requests"}
{"level":"info","ts":"2024-08-22T01:56:48.520298Z","caller":"embed/serve.go:250","msg":"serving client traffic securely","traffic":"http","address":"127.0.0.1:2382"}
{"level":"info","ts":"2024-08-22T01:56:48.520358Z","caller":"embed/serve.go:103","msg":"ready to serve client requests"}
{"level":"info","ts":"2024-08-22T01:56:48.521483Z","caller":"embed/serve.go:250","msg":"serving client traffic securely","traffic":"grpc","address":"127.0.0.1:2379"}
{"level":"info","ts":"2024-08-22T01:56:48.522852Z","caller":"embed/serve.go:103","msg":"ready to serve client requests"}
{"level":"info","ts":"2024-08-22T01:56:48.523995Z","caller":"embed/serve.go:250","msg":"serving client traffic securely","traffic":"grpc","address":"10.114.49.88:2379"}
{"level":"info","ts":"2024-08-22T01:56:48.525162Z","caller":"etcdmain/main.go:44","msg":"notifying init daemon"}
{"level":"info","ts":"2024-08-22T01:56:48.525183Z","caller":"etcdmain/main.go:50","msg":"successfully notified init daemon"}

abundant-hair-58573

08/22/2024, 2:12 AM

honestly I don't really know what I'm looking at there but that portion seemed relevant

abundant-hair-58573

08/22/2024, 2:29 AM

I found a bad certificate error in the logs for the good etcd node. This is complaining about one of the bad nodes.

Copy code

{"level":"warn","ts":"2024-08-22T01:56:20.758795Z","caller":"etcdserver/cluster_util.go:155","msg":"failed to get version","remote-member-id":"7ebc27a47a696333","error":"Get \"<https://10.114.49.88:2380/version>\": dial tcp 10.114.49.88:2380: connect: connection refused"}
{"level":"warn","ts":"2024-08-22T01:56:20.880189Z","caller":"embed/config_logging.go:169","msg":"rejected connection","remote-addr":"10.114.49.88:55648","server-name":"","error":"remote error: tls: bad certificate"}
{"level":"warn","ts":"2024-08-22T01:56:20.880905Z","caller":"embed/config_logging.go:169","msg":"rejected connection","remote-addr":"10.114.49.88:55658","server-name":"","error":"remote error: tls: bad certificate"}
{"level":"warn","ts":"2024-08-22T01:56:20.892152Z","caller":"embed/config_logging.go:169","msg":"rejected connection","remote-addr":"10.114.49.88:55686","server-name":"","error":"remote error: tls: bad certificate"}
{"level":"warn","ts":"2024-08-22T01:56:20.892637Z","caller":"embed/config_logging.go:169","msg":"rejected connection","remote-addr":"10.114.49.88:55672","server-name":"","error":"remote error: tls: bad certificate"}
{"level":"warn","ts":"2024-08-22T01:56:20.932421Z","caller":"embed/config_logging.go:169","msg":"rejected connection","remote-addr":"10.114.49.88:55700","server-name":"","error":"remote error: tls: bad certificate"}
{"level":"warn","ts":"2024-08-22T01:56:20.994648Z","caller":"embed/config_logging.go:169","msg":"rejected connection","remote-addr":"10.114.49.88:55716","server-name":"","error":"remote error: tls: bad certificate"}
{"level":"warn","ts":"2024-08-22T01:56:20.994998Z","caller":"embed/config_logging.go:169","msg":"rejected connection","remote-addr":"10.114.49.88:55704","server-name":"","error":"remote error: tls: bad certificate"}
{"level":"warn","ts":"2024-08-22T01:56:21.102371Z","caller":"embed/config_logging.go:169","msg":"rejected connection","remote-addr":"10.114.49.88:55732","server-name":"","error":"remote error: tls: bad certificate"}
{"level":"warn","ts":"2024-08-22T01:56:21.103052Z","caller":"embed/config_logging.go:169","msg":"rejected connection","remote-addr":"10.114.49.88:55722","server-name":"","error":"remote error: tls: bad certificate"}
{"level":"warn","ts":"2024-08-22T01:56:21.268047Z","caller":"embed/config_logging.go:169","msg":"rejected connection","remote-addr":"10.114.49.88:55740","server-name":"","error":"read tcp 10.114.49.102:2380->10.114.49.88:55740: read: connection reset by peer"}
{"level":"warn","ts":"2024-08-22T01:56:21.269205Z","caller":"embed/config_logging.go:169","msg":"rejected connection","remote-addr":"10.114.49.88:55738","server-name":"","error":"read tcp 10.114.49.102:2380->10.114.49.88:55738: read: connection reset by peer"}

creamy-pencil-82913

08/22/2024, 2:47 AM

Did something happen where they lost data? Is the time set correctly? Nothing else weird going on?

creamy-pencil-82913

08/22/2024, 2:48 AM

Worst case scenario you can just stop them all, start one of them with

rke2 server --cluster-reset

to set etcd membership back to a single node only, then delete the db from the other two nodes and rejoin them to the cluster.

abundant-hair-58573

08/22/2024, 2:54 AM

I have no idea what happened. A bunch of yum updates were run late last week when I was out. I came in Monday morning and saw the cluster was down. I noticed I had 3 new control planes as of Sunday morning. I looked at the control plane ASG and noticed it was constantly replacing unhealthy nodes starting around the time the updates were run on Friday. It kept doing that until Sunday morning when those instances were finally stable. I have the etcd instances in an ASG too, but I've suspended ReplaceUnhealthy Terminate because I definitely don't want those nodes terminated. I imagine whatever was causing the control plane instances to become unresponsive also happened to the etcd nodes, but I can't be certain.

abundant-hair-58573

08/22/2024, 2:55 AM

I don't even know how this one etcd node is healthy, it was in the same state as the other two and I've rebooted it several times these past few days when I have a chance to look at. Then after a reboot a couple hours ago it was magically happy

creamy-pencil-82913

08/22/2024, 4:01 AM

Ohh well if you have no members left from the original cluster, it's very possible none of them have any of the original etcd data left, in which case yeah you'd want to restore from a snapshot

abundant-hair-58573

08/23/2024, 2:53 PM

@creamy-pencil-82913 I still had a good etcd node running, so I stopped rke2-server on all 3 etcd nodes (2 bad 1 good), and then on the good one I ran the

rke2 server --cluster-reset

which gave an error about the server flag. So I deleted the server line from

/etc/rancher/rke2/config.yaml.d/50-rancher.yaml

and continued with the reset, which seemed to work. I tried deleting the DB on another etcd node and starting the rke2-server process, now my control planes say they're down. I suspect it's because that server setting is still in their rke2 configs. Do I need to delete that from the control plane and other etcd node configs too? That server is actually one of the bad ones that I'm trying to bring back online, I guess it was the initial bootstrap server for my cluster?

abundant-hair-58573

08/23/2024, 2:54 PM

Should I have completed removed the 2 bad etcd nodes from the cluster in the cluster management view too? It's not in the instructions here so I didn't do that https://docs.rke2.io/backup_restore

abundant-hair-58573

08/30/2024, 5:56 PM

I brought the cluster down to just one etcd node and ran the cluster reset. I ran into problems trying to rejoin the other two etcd nodes, so I just deleted them and brought in 2 new etcd nodes. I also had to manually update the "server" value on my control plane nodes in

/etc/rancher/rke2/config.yaml.d/50-rancher.yaml

to reflect the new etcd node, the value it had was an old etcd node. The docs don't mention that step, maybe it isn't necessary in some configurations but it definitely was in mine.

abundant-hair-58573

08/30/2024, 5:58 PM

I also had to redeploy a few things in the cluster once it was back up with all of the etcd, control plane, and worker nodes. I don't remember exactly what I redeployed but I know it was at least the coredns and canal deployments. It was a good exercise, I'm glad I didn't just wipe it away and start from scratch, which was tempting since it's just a test cluster.

creamy-pencil-82913

08/30/2024, 6:31 PM

If you’re using rancher, you should do the node delete and restore from snapshot using rancher. Otherwise Rancher will try to reset the content of 50-rancher.yaml, and manage the service state, out from under you.

abundant-hair-58573

09/01/2024, 4:46 PM

Good to know, thanks. I didn't have to restore from snapshot this time, since I still had one good etcd node luckily.

258 Views

Open in Slack

Previous Next