This message was deleted Rancher Users #rke2

Join Slack

This message was deleted.

# rke2

adamant-kite-43734

04/08/2024, 2:16 PM

This message was deleted.

ambitious-plastic-3551

04/08/2024, 2:52 PM

The scenario is not normal, even if you have replicas it’s still single point of failure and defeats the purpose of having more control planes - aside anything can go wrong if there’s a total blackout, all 3 can have a failure and corruption now, should check more thoroughly the system logs for etcd

curved-piano-98970

04/08/2024, 3:08 PM

But the thing is, is it normal that it isn't able to boot due to kube-proxy not starting?

curved-piano-98970

04/08/2024, 3:08 PM

But not due to the fact of corruption?

ambitious-plastic-3551

04/08/2024, 3:09 PM

What was your goal?

curved-piano-98970

04/08/2024, 3:09 PM

My goal is to shut it down and boot it backup, even if i lose some data i don't care, but i want it to reconnect

ambitious-plastic-3551

04/08/2024, 3:10 PM

bad luck this time 😅

ambitious-plastic-3551

04/08/2024, 3:11 PM

it's more complicated when you have >1 node, it might be that all are corrupted, or two of them to not functioning

ambitious-plastic-3551

04/08/2024, 3:11 PM

and they cannot reach quorum

curved-piano-98970

04/08/2024, 3:11 PM

so instead of running 3 all-in-one

curved-piano-98970

04/08/2024, 3:11 PM

you suggest to run 1 control-etcd and 2 workers?

ambitious-plastic-3551

04/08/2024, 3:11 PM

That's why you normally don't have all control planes on one host

curved-piano-98970

04/08/2024, 3:12 PM

This is one proxmox for laboratory purposes

ambitious-plastic-3551

04/08/2024, 3:12 PM

I know it's lab...

ambitious-plastic-3551

04/08/2024, 3:13 PM

if you don't need HA scenarios it's better to stick to one control plane node

ambitious-plastic-3551

04/08/2024, 3:14 PM

and laboratory normally or development installation(s) don't need that

curved-piano-98970

04/08/2024, 3:14 PM

Ok perfect

curved-piano-98970

04/08/2024, 3:14 PM

yes yes

curved-piano-98970

04/08/2024, 3:14 PM

I don't you're correct

curved-piano-98970

04/08/2024, 3:14 PM

So i'll go with 1 control and 3 workers

curved-piano-98970

04/08/2024, 3:15 PM

nice to know

curved-piano-98970

04/08/2024, 3:15 PM

Isn't it a little bit scary though that if in production a disaster happens they can't reboot without manual assistance?

ambitious-plastic-3551

04/08/2024, 3:15 PM

you can kill your agents in anyway

ambitious-plastic-3551

04/08/2024, 3:15 PM

control plane can still be problematic

curved-piano-98970

04/08/2024, 3:15 PM

Yeah i will remake the cluster since i only have rke2-servers

ambitious-plastic-3551

04/08/2024, 3:16 PM

in production you need to put them separately, ideally also different location, but within same network

curved-piano-98970

04/08/2024, 3:17 PM

totally it is what we're doing

curved-piano-98970

04/08/2024, 3:17 PM

but

curved-piano-98970

04/08/2024, 3:18 PM

yeah it's a little bit scary regardless

ambitious-plastic-3551

04/08/2024, 3:18 PM

Fragile systems

ambitious-plastic-3551

04/08/2024, 3:18 PM

Sometimes

ambitious-plastic-3551

04/08/2024, 3:18 PM

but normally you also want UPS

ambitious-plastic-3551

04/08/2024, 3:19 PM

and batteries on hard drives that they finish writing

ambitious-plastic-3551

04/08/2024, 3:19 PM

etc..

curved-piano-98970

04/08/2024, 3:20 PM

I believe they are all setup like that

ambitious-plastic-3551

04/08/2024, 3:21 PM

my PC normally picks up, I have similar setup as I mentioned

ambitious-plastic-3551

04/08/2024, 3:21 PM

1cp, 1w

curved-piano-98970

04/08/2024, 3:21 PM

But since the datacenters are mine we just provide the k8s stack i worried a little bit

curved-piano-98970

04/08/2024, 3:21 PM

maybe too much x)

curved-piano-98970

04/08/2024, 3:21 PM

yes i'll do 1 control plane and 1 worker

curved-piano-98970

04/08/2024, 3:21 PM

thank you very much for now

curved-piano-98970

04/08/2024, 3:26 PM

btw

ambitious-plastic-3551

04/08/2024, 3:26 PM

did it come up?

curved-piano-98970

04/08/2024, 3:26 PM

in my 1 control plane can i run both etcd role and control plane ?

curved-piano-98970

04/08/2024, 3:26 PM

i am making the cluster from scratch..

ambitious-plastic-3551

04/08/2024, 3:26 PM

yes you can

curved-piano-98970

04/08/2024, 3:27 PM

but yes if i run it from scratch it works

curved-piano-98970

04/08/2024, 3:27 PM

i will try to reboot

curved-piano-98970

04/08/2024, 3:27 PM

with just 1 control plane

ambitious-plastic-3551

04/08/2024, 3:27 PM

Sure

ambitious-plastic-3551

04/08/2024, 3:27 PM

go ahead 🙂

curved-piano-98970

04/08/2024, 3:38 PM

this is the current state i'll power off each one then reboot first the controller

👍 1

curved-piano-98970

04/08/2024, 3:42 PM

nope

curved-piano-98970

04/08/2024, 3:42 PM

it doesn't work

curved-piano-98970

04/08/2024, 3:42 PM

Copy code

Apr 08 15:41:34 aio-rke2-01 rke2[734]: time="2024-04-08T15:41:34Z" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: <https://127.0.0.1:9345/v1-rke2>>
Apr 08 15:41:34 aio-rke2-01 rke2[734]: time="2024-04-08T15:41:34Z" level=info msg="Cluster Role Bindings applied successfully"
Apr 08 15:41:37 aio-rke2-01 rke2[734]: time="2024-04-08T15:41:37Z" level=info msg="Pod for etcd not synced (waiting for termination of old pod sandbox), retrying"
Apr 08 15:41:39 aio-rke2-01 rke2[734]: time="2024-04-08T15:41:39Z" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: <https://127.0.0.1:9345/v1-rke2>>
Apr 08 15:41:45 aio-rke2-01 rke2[734]: time="2024-04-08T15:41:45Z" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: <https://127.0.0.1:9345/v1-rke2>>
Apr 08 15:41:45 aio-rke2-01 rke2[734]: {"level":"warn","ts":"2024-04-08T15:41:45.667884Z","logger":"etcd-client","caller":"v3@v3.5.9-k3s1/retry_interceptor.go:62","msg":"retrying o>
Apr 08 15:41:45 aio-rke2-01 rke2[734]: {"level":"info","ts":"2024-04-08T15:41:45.667968Z","logger":"etcd-client","caller":"v3@v3.5.9-k3s1/client.go:210","msg":"Auto sync endpoints >
Apr 08 15:41:50 aio-rke2-01 rke2[734]: time="2024-04-08T15:41:50Z" level=warning msg="Failed to list nodes with etcd role: runtime core not ready"
Apr 08 15:41:50 aio-rke2-01 rke2[734]: time="2024-04-08T15:41:50Z" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: <https://127.0.0.1:9345/v1-rke2>>
Apr 08 15:41:55 aio-rke2-01 rke2[734]: time="2024-04-08T15:41:55Z" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: <https://127.0.0.1:9345/v1-rke2>>

ambitious-plastic-3551

04/08/2024, 3:43 PM

is etcd running on the node?

curved-piano-98970

04/08/2024, 3:43 PM

ok no nvm

curved-piano-98970

04/08/2024, 3:44 PM

it just took a lot

curved-piano-98970

04/08/2024, 3:44 PM

but it worked

curved-piano-98970

04/08/2024, 3:44 PM

now the control plane and etcd is working

ambitious-plastic-3551

04/08/2024, 3:44 PM

curved-piano-98970

04/08/2024, 3:44 PM

just idk i believe the same behavior can be replicated also if control planes are > 1..

curved-piano-98970

04/08/2024, 3:44 PM

the issue is the kube-proxy not starting when there are 3 control planes

ambitious-plastic-3551

04/08/2024, 3:45 PM

Maybe just needs more then than 1 node 🙂

ambitious-plastic-3551

04/08/2024, 3:45 PM

how long did you want

curved-piano-98970

04/08/2024, 3:45 PM

i waited like 3-4 minutes

curved-piano-98970

04/08/2024, 3:45 PM

but before when i asked in slack

curved-piano-98970

04/08/2024, 3:45 PM

i waited like 20 minutes for the rke2-server to start

curved-piano-98970

04/08/2024, 3:45 PM

and the issue was the same kube-proxy not starting. if there was some logic i believe we could recover even if there are 3 nodes

ambitious-plastic-3551

04/08/2024, 3:45 PM

context deadline is 10minutes

ambitious-plastic-3551

04/08/2024, 3:45 PM

by default

curved-piano-98970

04/08/2024, 3:46 PM

yeah and it doesn't start..

curved-piano-98970

04/08/2024, 3:47 PM

when there are 3 control planes, but it doesn't say "this or that" is corrupted, just the kube-proxy logic isn't handled properly and it breaks at the kube-proxy startup

ambitious-plastic-3551

04/08/2024, 3:51 PM

Yeah, the problem is that etcd isn't up - 08T131002Z" level=error msg="Failed to check local etcd status for learner management: context deadline exceeded"

ambitious-plastic-3551

04/08/2024, 3:52 PM

https://github.com/rancher/rke2/issues/4510

curved-piano-98970

04/08/2024, 3:53 PM

Yeah but the issue is closed..

curved-piano-98970

04/08/2024, 3:53 PM

In reality there still is an issue

curved-piano-98970

04/08/2024, 3:53 PM

No no, the only thing that isn't up in the 3 controller scenario is the kube-proxy

ambitious-plastic-3551

04/08/2024, 3:54 PM

because they couldn't reproduce 🙂

ambitious-plastic-3551

04/08/2024, 3:54 PM

typically

curved-piano-98970

04/08/2024, 3:56 PM

hmm

curved-piano-98970

04/08/2024, 3:56 PM

i believe it is pretty straight forward to reproduce

curved-piano-98970

04/08/2024, 3:56 PM

have 3 all in ones, power off all 3

curved-piano-98970

04/08/2024, 3:56 PM

then boot up 1 vm

ambitious-plastic-3551

04/08/2024, 5:07 PM

Then open an issue if it's reproducible

ambitious-plastic-3551

04/08/2024, 5:08 PM

but have in mind that it can work differently in other environments

important-activity-72202

04/11/2024, 3:21 PM

Perhaps another detail to keep in mind: if you have control plane on 3 nodes, plus n workers, and the control planes o different hardware (perhaps virtual, but no on the same host), you can still run into trouble, if you have a power outage. In that case (even if unlikely), you need a plan to get started again. Mine would be to restore etcd from a backup. And have backups made every 5 minutes or so.

138 Views

Open in Slack

Previous Next