This message was deleted Rancher Users #rke2

Join Slack

This message was deleted.

# rke2

adamant-kite-43734

01/25/2024, 4:36 PM

This message was deleted.

creamy-pencil-82913

01/25/2024, 5:29 PM

for server roles, we have

etcd

and

control-plane

, we don’t use the term

master

worker

creamy-pencil-82913

01/25/2024, 5:30 PM

it looks like this control-plane node is just starting up and has been configured to join a server that is not available. Is that correct?

creamy-pencil-82913

01/25/2024, 5:31 PM

if you’re using Rancher, it should manage the initial node (the one that the others are configured to join) for you. However, if you interfere with that by intentionally disabling that node, you may see strange results.

creamy-pencil-82913

01/25/2024, 5:32 PM

In a production outage scenario, we would suggest removing that node and letting rancher select a new init node for you.

creamy-pencil-82913

01/25/2024, 5:34 PM

If this was a standalone cluster (not managed by rancher) we would suggest that you configure a HA registration endpoint: https://docs.rke2.io/install/ha#1-configure-the-fixed-registration-address but rancher-managed clusters do this differently, by picking a single node as the registration endpoint.

powerful-table-93807

01/25/2024, 5:50 PM

So as I understand in rke2 terminology

server

is etcd or control plane node and

agent

is worker node. If I separate etcd and control plane for HA I'll have 6

server

nodes. Or for production is still fine to have server node with etcd + control plane ? For this I can for example take haproxy, pick any static internal ip address and do load balancing for this 6 nodes on port

powerful-table-93807

01/25/2024, 5:52 PM

Copy code

if you're using Rancher, it should manage the initial node (the one that the others are configured to join) for you. However, if you interfere with that by intentionally disabling that node, you may see strange results.
brandond  [7:32 PM]
In a production outage scenario, we would suggest removing that node and letting rancher select a new init node for you.

@creamy-pencil-82913 is there any rancher documentation for this case because i don't have a clear picture how to do it ?

powerful-table-93807

01/25/2024, 5:55 PM

@creamy-pencil-82913 Or you mean in Rancher case I don't need load balancer just remove a failed node and it will continue working ?

creamy-pencil-82913

01/25/2024, 7:07 PM

correct

powerful-table-93807

01/25/2024, 7:22 PM

Ok. So with no load balance should be fine. We just have on rancher configure alerts to send when node is down, so we can remove it and let rancher to select a new init server

creamy-pencil-82913

01/25/2024, 7:28 PM

Well I’m confused by what exactly you’re testing here. There’s no reason that shutting down one of your etcd nodes should cause other nodes to restart, why is rke2 on the control-plane node restarting at this particular moment?

powerful-table-93807

01/25/2024, 7:34 PM

I'm also confused. It's current state of cluster . Let me turn of etcd 1 VM on proxmox

powerful-table-93807

01/25/2024, 7:36 PM

Btw, It started showing Updating state what is also strange

powerful-table-93807

01/25/2024, 7:40 PM

But it still shows all nodes are running

creamy-pencil-82913

01/25/2024, 7:41 PM

what exactly did you do? just stopped one of the VMs?

creamy-pencil-82913

01/25/2024, 7:42 PM

What caused the rke2 service on the other node to restart?

powerful-table-93807

01/25/2024, 7:43 PM

Yes. Just stopped VM to emulate it went down

powerful-table-93807

01/25/2024, 7:43 PM

control plane 2 started failing

powerful-table-93807

01/25/2024, 7:45 PM

other cp is also crashing with the same error. So no control plane servers are working now

creamy-pencil-82913

01/25/2024, 7:47 PM

was it working before you shut the other node down? Why did the service get stopped?

creamy-pencil-82913

01/25/2024, 7:47 PM

The logs should show what caused it to get restarted.

powerful-table-93807

01/25/2024, 7:49 PM

See also this logs

creamy-pencil-82913

01/25/2024, 7:55 PM

thats all normal, pods remain running when the service restarts, and systemd complains about it

powerful-table-93807

01/25/2024, 7:55 PM

Copy code

The logs should show what caused it to get restarted.

Is there are any command I can run to verify it ?

powerful-table-93807

01/25/2024, 7:58 PM

I opened */etc/rancher/rke2/config.yaml.d/*50-rancher.yaml and server points 120.11

Copy code

"server": "<https://xx.xx.120.11:9345>",

creamy-pencil-82913

01/25/2024, 7:58 PM

yes, that’s managed by rancher

creamy-pencil-82913

01/25/2024, 7:58 PM

but once servers have joined the cluster, they should not need that server to be accessible any longer. it is just for joining.

creamy-pencil-82913

01/25/2024, 7:59 PM

Look through the journald logs for restarts of the rke2-server service. see why it was stopped or exited, around the time you shut down the etcd node.

powerful-table-93807

01/25/2024, 8:00 PM

Hmm. Changed adress from 120.11 to 120.19 to another etcd node it worked

powerful-table-93807

01/25/2024, 8:02 PM

What logs it shows now. And it configures with differernt load balanced ip addresses for etcd

powerful-table-93807

01/25/2024, 8:05 PM

another cp is also up. Do you want to me to turn 120.19 etcd and see what logs it will show ?

powerful-table-93807

01/25/2024, 8:09 PM

When other etcd is down

creamy-pencil-82913

01/25/2024, 8:09 PM

sure? I think you’re kinda pushing the edge of rancher’s cluster management here though. You’re modifying the config out from under rancher; it will likely try to put it back for you.

powerful-table-93807

01/25/2024, 8:11 PM

Other cp is not crashing but trying to connect to cp 2

powerful-table-93807

01/25/2024, 9:31 PM

BTW. For production do you recommend to use separate etcd, control-plane nodes or deploy servers with etcd + control-plane ?

creamy-pencil-82913

01/25/2024, 9:32 PM

etcd+cp is generally fine, unless you have IO load or something like that and can’t scale the disks up sufficiently to keep etcd happy

creamy-pencil-82913

01/25/2024, 9:32 PM

I don’t generally see it as necessary on most clusters

powerful-table-93807

01/25/2024, 9:35 PM

It will me small cluster for now with up to 10 worker nodes. But they will be big with 50 vCPU and 200 GB of RAM each

powerful-table-93807

01/25/2024, 9:36 PM

I'll try setup etcd+cp for servers and will recreate a cluster. After this will test HA mode one more time by stopping server VMs

powerful-table-93807

01/25/2024, 10:02 PM

When I create a cluster with Custom what should I choose -> Default RKE2 Embedded or External ?

creamy-pencil-82913

01/25/2024, 10:17 PM

are you planning on deploying your own cloud controller, or using one of the other choices?

powerful-table-93807

01/25/2024, 10:27 PM

Sorry I'm not familiar with cloud controller. I want to setup cluster on prem datacenter.

creamy-pencil-82913

01/25/2024, 10:28 PM

if you don’t know, I would probably leave it at the default then

powerful-table-93807

01/25/2024, 10:29 PM

Ok. Thanks

powerful-table-93807

01/25/2024, 11:16 PM

This time it looks like worked. Jan 25 231206 my-cluster-master-2 rke2[1218]: time="2024-01-25T231206Z" level=info msg="Stopped tunnel to 10.73.120.11:9345" Jan 25 231206 my-cluster-master-2 rke2[1218]: time="2024-01-25T231206Z" level=info msg="Proxy done" err="context canceled" url="wss://10.73.120.11:9345/v1-rke2/connect" Jan 25 231215 my-cluster-master-2 rke2[1218]: time="2024-01-25T231215Z" level=info msg="Creating helm-controller event broadcaster" Jan 25 231215 my-cluster-master-2 rke2[1218]: time="2024-01-25T231215Z" level=info msg="Creating reencrypt-controller event broadcaster" Jan 25 231216 my-cluster-master-2 rke2[1218]: time="2024-01-25T231216Z" level=info msg="Starting helm.cattle.io/v1, Kind=HelmChart controller" Jan 25 231216 my-cluster-master-2 rke2[1218]: time="2024-01-25T231216Z" level=info msg="Starting helm.cattle.io/v1, Kind=HelmChartConfig controller" Jan 25 231216 my-cluster-master-2 rke2[1218]: time="2024-01-25T231216Z" level=info msg="Starting rbac.authorization.k8s.io/v1, Kind=ClusterRoleBinding controller" Jan 25 231216 my-cluster-master-2 rke2[1218]: time="2024-01-25T231216Z" level=info msg="Starting batch/v1, Kind=Job controller" Jan 25 231216 my-cluster-master-2 rke2[1218]: time="2024-01-25T231216Z" level=info msg="Starting /v1, Kind=ConfigMap controller" Jan 25 231216 my-cluster-master-2 rke2[1218]: time="2024-01-25T231216Z" level=info msg="Starting /v1, Kind=ServiceAccount controller" Jan 25 231224 my-cluster-master-2 rke2[1218]: time="2024-01-25T231224Z" level=info msg="Starting managed etcd apiserver addresses controller" Jan 25 231224 my-cluster-master-2 rke2[1218]: time="2024-01-25T231224Z" level=info msg="Starting managed etcd member removal controller" Jan 25 231224 my-cluster-master-2 rke2[1218]: time="2024-01-25T231224Z" level=info msg="Starting managed etcd snapshot ConfigMap controller" Jan 25 231226 my-cluster-master-2 rke2[1218]: time="2024-01-25T231226Z" level=info msg="error in remotedialer server [400]: read tcp 10.73.120.129345 >10.73.120.1156142: i/o timeout"

powerful-table-93807

01/25/2024, 11:17 PM

In previous cluster I remember I wanted to troubleshoot something and removed etcd db on all nodes with no restoration. Looks like it was a cause

powerful-table-93807

01/25/2024, 11:18 PM

If I have 3 servers nodes with etcd + cp how node failures it will allow, 1 or 2 ?

powerful-table-93807

01/25/2024, 11:20 PM

After stopping 2 vms of 3 last one was giving these errors Jan 25 231859 my-cluster-master-2 rke2[17873]: time="2024-01-25T231859Z" level=info msg="Waiting for etcd server to become available" Jan 25 231859 my-cluster-master-2 rke2[17873]: time="2024-01-25T231859Z" level=info msg="Waiting for API server to become available" Jan 25 231901 my-cluster-master-2 rke2[17873]: time="2024-01-25T231901Z" level=info msg="Defragmenting etcd database" Jan 25 231901 my-cluster-master-2 rke2[17873]: time="2024-01-25T231901Z" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:9345/v1-rke2/readyz: 500 Internal Server Error" Jan 25 231906 my-cluster-master-2 rke2[17873]: time="2024-01-25T231906Z" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:9345/v1-rke2/readyz: 500 Internal Server Error" Jan 25 231911 my-cluster-master-2 rke2[17873]: time="2024-01-25T231911Z" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:9345/v1-rke2/readyz: 500 Internal Server Error" Jan 25 231913 my-cluster-master-2 rke2[17873]: {"level":"warn","ts":"2024-01-25T231913.886557Z","logger":"etcd-client","caller":"v3@v3.5.9-k3s1/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000759340/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"} Jan 25 231913 my-cluster-master-2 rke2[17873]: {"level":"info","ts":"2024-01-25T231913.88666Z","logger":"etcd-client","caller":"v3@v3.5.9-k3s1/client.go:210","msg":"Auto sync endpoints failed.","error":"context deadline exceeded"} Jan 25 231916 my-cluster-master-2 rke2[17873]: {"level":"warn","ts":"2024-01-25T231916.554983Z","logger":"etcd-client","caller":"v3@v3.5.9-k3s1/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000759340/127.0.0.1:2379","attempt":0,"error":"rpc error: code = Unknown desc = context deadline exceeded"} Jan 25 231916 my-cluster-master-2 rke2[17873]: time="2024-01-25T231916Z" level=info msg="Failed to test data store connection: failed to report and disarm etcd alarms: etcd alarm list failed: rpc error: code = Unknown desc = context deadline exceeded" Jan 25 231916 my-cluster-master-2 rke2[17873]: time="2024-01-25T231916Z" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:9345/v1-rke2/readyz: 500 Internal Server Error" Jan 25 231921 my-cluster-master-2 rke2[17873]: time="2024-01-25T231921Z" level=info msg="Defragmenting etcd database" Jan 25 231921 my-cluster-master-2 rke2[17873]: time="2024-01-25T231921Z" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:9345/v1-rke2/readyz: 500 Internal Server Error" Jan 25 231926 my-cluster-master-2 rke2[17873]: time="2024-01-25T231926Z" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:9345/v1-rke2/readyz: 500 Internal Server Error" Jan 25 231928 my-cluster-master-2 rke2[17873]: {"level":"warn","ts":"2024-01-25T231928.886947Z","logger":"etcd-client","caller":"v3@v3.5.9-k3s1/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000759340/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"} Jan 25 231928 my-cluster-master-2 rke2[17873]: {"level":"info","ts":"2024-01-25T231928.887031Z","logger":"etcd-client","caller":"v3@v3.5.9-k3s1/client.go:210","msg":"Auto sync endpoints failed.","error":"context deadline exceeded"} Jan 25 231929 my-cluster-master-2 rke2[17873]: time="2024-01-25T231929Z" level=info msg="Waiting for API server to become available" Jan 25 231929 my-cluster-master-2 rke2[17873]: time="2024-01-25T231929Z" level=info msg="Waiting for etcd server to become available" Jan 25 231931 my-cluster-master-2 rke2[17873]: time="2024-01-25T231931Z" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:9345/v1-rke2/readyz: 500 Internal Server Error" Jan 25 231936 my-cluster-master-2 rke2[17873]: {"level":"warn","ts":"2

creamy-pencil-82913

01/25/2024, 11:24 PM

> I wanted to troubleshoot something and removed etcd db on all nodes with no restoration Yes that would cause problems for sure > If I have 3 servers nodes with etcd + cp how node failures it will allow https://etcd.io/docs/v3.5/faq/#what-is-failure-tolerance

powerful-table-93807

01/25/2024, 11:25 PM

Thanks you so much. Will continue experimenting with a cluster

powerful-table-93807

01/25/2024, 11:36 PM

After server node vm was restarted it is showing 0 size for newly create etcd snapshot

creamy-pencil-82913

01/25/2024, 11:39 PM

that is a rancher bug. if you look on the nodes, they actually have a non-zero size. they were taken before the rancher agent was deployed, probably because you configured snapshots to run on an interval, and the snapshots are lacking metadata that rancher expects.

creamy-pencil-82913

01/25/2024, 11:39 PM

You can trigger additional manual snapshots, or just wait for the next automatic interval.

powerful-table-93807

01/25/2024, 11:40 PM

creamy-pencil-82913

01/25/2024, 11:40 PM

you can just ignore them, they’ll get cleaned up eventually after you take more snapshots.

👍 1

powerful-table-93807

01/26/2024, 2:37 PM

It still showing 0 size

56 Views

Open in Slack

Previous Next