This message was deleted.
# rke2
a
This message was deleted.
c
for server roles, we have
etcd
and
control-plane
, we don’t use the term
master
or
worker
.
it looks like this control-plane node is just starting up and has been configured to join a server that is not available. Is that correct?
if you’re using Rancher, it should manage the initial node (the one that the others are configured to join) for you. However, if you interfere with that by intentionally disabling that node, you may see strange results.
In a production outage scenario, we would suggest removing that node and letting rancher select a new init node for you.
If this was a standalone cluster (not managed by rancher) we would suggest that you configure a HA registration endpoint: https://docs.rke2.io/install/ha#1-configure-the-fixed-registration-address but rancher-managed clusters do this differently, by picking a single node as the registration endpoint.
p
So as I understand in rke2 terminology
server
is etcd or control plane node and
agent
is worker node. If I separate etcd and control plane for HA I'll have 6
server
nodes. Or for production is still fine to have server node with etcd + control plane ? For this I can for example take haproxy, pick any static internal ip address and do load balancing for this 6 nodes on port
9345
.
Copy code
if you're using Rancher, it should manage the initial node (the one that the others are configured to join) for you. However, if you interfere with that by intentionally disabling that node, you may see strange results.
brandond  [7:32 PM]
In a production outage scenario, we would suggest removing that node and letting rancher select a new init node for you.
@creamy-pencil-82913 is there any rancher documentation for this case because i don't have a clear picture how to do it ?
@creamy-pencil-82913 Or you mean in Rancher case I don't need load balancer just remove a failed node and it will continue working ?
c
correct
p
Ok. So with no load balance should be fine. We just have on rancher configure alerts to send when node is down, so we can remove it and let rancher to select a new init server
c
Well I’m confused by what exactly you’re testing here. There’s no reason that shutting down one of your etcd nodes should cause other nodes to restart, why is rke2 on the control-plane node restarting at this particular moment?
p
I'm also confused. It's current state of cluster . Let me turn of etcd 1 VM on proxmox
Btw, It started showing Updating state what is also strange
But it still shows all nodes are running
c
what exactly did you do? just stopped one of the VMs?
What caused the rke2 service on the other node to restart?
p
Yes. Just stopped VM to emulate it went down
control plane 2 started failing
other cp is also crashing with the same error. So no control plane servers are working now
c
was it working before you shut the other node down? Why did the service get stopped?
The logs should show what caused it to get restarted.
p
See also this logs
c
thats all normal, pods remain running when the service restarts, and systemd complains about it
p
Copy code
The logs should show what caused it to get restarted.
Is there are any command I can run to verify it ?
I opened */etc/rancher/rke2/config.yaml.d/*50-rancher.yaml and server points 120.11
Copy code
"server": "<https://xx.xx.120.11:9345>",
c
yes, that’s managed by rancher
but once servers have joined the cluster, they should not need that server to be accessible any longer. it is just for joining.
Look through the journald logs for restarts of the rke2-server service. see why it was stopped or exited, around the time you shut down the etcd node.
p
Hmm. Changed adress from 120.11 to 120.19 to another etcd node it worked
What logs it shows now. And it configures with differernt load balanced ip addresses for etcd
another cp is also up. Do you want to me to turn 120.19 etcd and see what logs it will show ?
When other etcd is down
c
sure? I think you’re kinda pushing the edge of rancher’s cluster management here though. You’re modifying the config out from under rancher; it will likely try to put it back for you.
p
Other cp is not crashing but trying to connect to cp 2
BTW. For production do you recommend to use separate etcd, control-plane nodes or deploy servers with etcd + control-plane ?
c
etcd+cp is generally fine, unless you have IO load or something like that and can’t scale the disks up sufficiently to keep etcd happy
I don’t generally see it as necessary on most clusters
p
It will me small cluster for now with up to 10 worker nodes. But they will be big with 50 vCPU and 200 GB of RAM each
I'll try setup etcd+cp for servers and will recreate a cluster. After this will test HA mode one more time by stopping server VMs
When I create a cluster with Custom what should I choose -> Default RKE2 Embedded or External ?
c
are you planning on deploying your own cloud controller, or using one of the other choices?
p
Sorry I'm not familiar with cloud controller. I want to setup cluster on prem datacenter.
c
if you don’t know, I would probably leave it at the default then
p
Ok. Thanks
This time it looks like worked. Jan 25 231206 my-cluster-master-2 rke2[1218]: time="2024-01-25T231206Z" level=info msg="Stopped tunnel to 10.73.120.11:9345" Jan 25 231206 my-cluster-master-2 rke2[1218]: time="2024-01-25T231206Z" level=info msg="Proxy done" err="context canceled" url="wss://10.73.120.11:9345/v1-rke2/connect" Jan 25 231215 my-cluster-master-2 rke2[1218]: time="2024-01-25T231215Z" level=info msg="Creating helm-controller event broadcaster" Jan 25 231215 my-cluster-master-2 rke2[1218]: time="2024-01-25T231215Z" level=info msg="Creating reencrypt-controller event broadcaster" Jan 25 231216 my-cluster-master-2 rke2[1218]: time="2024-01-25T231216Z" level=info msg="Starting helm.cattle.io/v1, Kind=HelmChart controller" Jan 25 231216 my-cluster-master-2 rke2[1218]: time="2024-01-25T231216Z" level=info msg="Starting helm.cattle.io/v1, Kind=HelmChartConfig controller" Jan 25 231216 my-cluster-master-2 rke2[1218]: time="2024-01-25T231216Z" level=info msg="Starting rbac.authorization.k8s.io/v1, Kind=ClusterRoleBinding controller" Jan 25 231216 my-cluster-master-2 rke2[1218]: time="2024-01-25T231216Z" level=info msg="Starting batch/v1, Kind=Job controller" Jan 25 231216 my-cluster-master-2 rke2[1218]: time="2024-01-25T231216Z" level=info msg="Starting /v1, Kind=ConfigMap controller" Jan 25 231216 my-cluster-master-2 rke2[1218]: time="2024-01-25T231216Z" level=info msg="Starting /v1, Kind=ServiceAccount controller" Jan 25 231224 my-cluster-master-2 rke2[1218]: time="2024-01-25T231224Z" level=info msg="Starting managed etcd apiserver addresses controller" Jan 25 231224 my-cluster-master-2 rke2[1218]: time="2024-01-25T231224Z" level=info msg="Starting managed etcd member removal controller" Jan 25 231224 my-cluster-master-2 rke2[1218]: time="2024-01-25T231224Z" level=info msg="Starting managed etcd snapshot ConfigMap controller" Jan 25 231226 my-cluster-master-2 rke2[1218]: time="2024-01-25T231226Z" level=info msg="error in remotedialer server [400]: read tcp 10.73.120.129345 &gt;10.73.120.1156142: i/o timeout"
In previous cluster I remember I wanted to troubleshoot something and removed etcd db on all nodes with no restoration. Looks like it was a cause
If I have 3 servers nodes with etcd + cp how node failures it will allow, 1 or 2 ?
After stopping 2 vms of 3 last one was giving these errors Jan 25 231859 my-cluster-master-2 rke2[17873]: time="2024-01-25T231859Z" level=info msg="Waiting for etcd server to become available" Jan 25 231859 my-cluster-master-2 rke2[17873]: time="2024-01-25T231859Z" level=info msg="Waiting for API server to become available" Jan 25 231901 my-cluster-master-2 rke2[17873]: time="2024-01-25T231901Z" level=info msg="Defragmenting etcd database" Jan 25 231901 my-cluster-master-2 rke2[17873]: time="2024-01-25T231901Z" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:9345/v1-rke2/readyz: 500 Internal Server Error" Jan 25 231906 my-cluster-master-2 rke2[17873]: time="2024-01-25T231906Z" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:9345/v1-rke2/readyz: 500 Internal Server Error" Jan 25 231911 my-cluster-master-2 rke2[17873]: time="2024-01-25T231911Z" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:9345/v1-rke2/readyz: 500 Internal Server Error" Jan 25 231913 my-cluster-master-2 rke2[17873]: {"level":"warn","ts":"2024-01-25T231913.886557Z","logger":"etcd-client","caller":"v3@v3.5.9-k3s1/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000759340/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"} Jan 25 231913 my-cluster-master-2 rke2[17873]: {"level":"info","ts":"2024-01-25T231913.88666Z","logger":"etcd-client","caller":"v3@v3.5.9-k3s1/client.go:210","msg":"Auto sync endpoints failed.","error":"context deadline exceeded"} Jan 25 231916 my-cluster-master-2 rke2[17873]: {"level":"warn","ts":"2024-01-25T231916.554983Z","logger":"etcd-client","caller":"v3@v3.5.9-k3s1/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000759340/127.0.0.1:2379","attempt":0,"error":"rpc error: code = Unknown desc = context deadline exceeded"} Jan 25 231916 my-cluster-master-2 rke2[17873]: time="2024-01-25T231916Z" level=info msg="Failed to test data store connection: failed to report and disarm etcd alarms: etcd alarm list failed: rpc error: code = Unknown desc = context deadline exceeded" Jan 25 231916 my-cluster-master-2 rke2[17873]: time="2024-01-25T231916Z" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:9345/v1-rke2/readyz: 500 Internal Server Error" Jan 25 231921 my-cluster-master-2 rke2[17873]: time="2024-01-25T231921Z" level=info msg="Defragmenting etcd database" Jan 25 231921 my-cluster-master-2 rke2[17873]: time="2024-01-25T231921Z" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:9345/v1-rke2/readyz: 500 Internal Server Error" Jan 25 231926 my-cluster-master-2 rke2[17873]: time="2024-01-25T231926Z" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:9345/v1-rke2/readyz: 500 Internal Server Error" Jan 25 231928 my-cluster-master-2 rke2[17873]: {"level":"warn","ts":"2024-01-25T231928.886947Z","logger":"etcd-client","caller":"v3@v3.5.9-k3s1/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000759340/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"} Jan 25 231928 my-cluster-master-2 rke2[17873]: {"level":"info","ts":"2024-01-25T231928.887031Z","logger":"etcd-client","caller":"v3@v3.5.9-k3s1/client.go:210","msg":"Auto sync endpoints failed.","error":"context deadline exceeded"} Jan 25 231929 my-cluster-master-2 rke2[17873]: time="2024-01-25T231929Z" level=info msg="Waiting for API server to become available" Jan 25 231929 my-cluster-master-2 rke2[17873]: time="2024-01-25T231929Z" level=info msg="Waiting for etcd server to become available" Jan 25 231931 my-cluster-master-2 rke2[17873]: time="2024-01-25T231931Z" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:9345/v1-rke2/readyz: 500 Internal Server Error" Jan 25 231936 my-cluster-master-2 rke2[17873]: {"level":"warn","ts":"2
c
> I wanted to troubleshoot something and removed etcd db on all nodes with no restoration Yes that would cause problems for sure > If I have 3 servers nodes with etcd + cp how node failures it will allow https://etcd.io/docs/v3.5/faq/#what-is-failure-tolerance
p
Thanks you so much. Will continue experimenting with a cluster
After server node vm was restarted it is showing 0 size for newly create etcd snapshot
c
that is a rancher bug. if you look on the nodes, they actually have a non-zero size. they were taken before the rancher agent was deployed, probably because you configured snapshots to run on an interval, and the snapshots are lacking metadata that rancher expects.
You can trigger additional manual snapshots, or just wait for the next automatic interval.
p
Ok
c
you can just ignore them, they’ll get cleaned up eventually after you take more snapshots.
👍 1
p
It still showing 0 size