This message was deleted.
# rke2
a
This message was deleted.
a
The scenario is not normal, even if you have replicas it’s still single point of failure and defeats the purpose of having more control planes - aside anything can go wrong if there’s a total blackout, all 3 can have a failure and corruption now, should check more thoroughly the system logs for etcd
c
But the thing is, is it normal that it isn't able to boot due to kube-proxy not starting?
But not due to the fact of corruption?
a
What was your goal?
c
My goal is to shut it down and boot it backup, even if i lose some data i don't care, but i want it to reconnect
a
bad luck this time 😅
it's more complicated when you have >1 node, it might be that all are corrupted, or two of them to not functioning
and they cannot reach quorum
c
so instead of running 3 all-in-one
you suggest to run 1 control-etcd and 2 workers?
a
That's why you normally don't have all control planes on one host
c
This is one proxmox for laboratory purposes
a
I know it's lab...
if you don't need HA scenarios it's better to stick to one control plane node
and laboratory normally or development installation(s) don't need that
c
Ok perfect
yes yes
I don't you're correct
So i'll go with 1 control and 3 workers
nice to know
Isn't it a little bit scary though that if in production a disaster happens they can't reboot without manual assistance?
a
you can kill your agents in anyway
control plane can still be problematic
c
Yeah i will remake the cluster since i only have rke2-servers
a
in production you need to put them separately, ideally also different location, but within same network
c
totally it is what we're doing
but
yeah it's a little bit scary regardless
a
Fragile systems
Sometimes
but normally you also want UPS
and batteries on hard drives that they finish writing
etc..
c
I believe they are all setup like that
a
my PC normally picks up, I have similar setup as I mentioned
1cp, 1w
c
But since the datacenters are mine we just provide the k8s stack i worried a little bit
maybe too much x)
yes i'll do 1 control plane and 1 worker
thank you very much for now
btw
a
did it come up?
c
in my 1 control plane can i run both etcd role and control plane ?
i am making the cluster from scratch..
a
yes you can
c
but yes if i run it from scratch it works
i will try to reboot
with just 1 control plane
a
Sure
go ahead 🙂
c
this is the current state i'll power off each one then reboot first the controller
👍 1
nope
it doesn't work
Copy code
Apr 08 15:41:34 aio-rke2-01 rke2[734]: time="2024-04-08T15:41:34Z" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: <https://127.0.0.1:9345/v1-rke2>>
Apr 08 15:41:34 aio-rke2-01 rke2[734]: time="2024-04-08T15:41:34Z" level=info msg="Cluster Role Bindings applied successfully"
Apr 08 15:41:37 aio-rke2-01 rke2[734]: time="2024-04-08T15:41:37Z" level=info msg="Pod for etcd not synced (waiting for termination of old pod sandbox), retrying"
Apr 08 15:41:39 aio-rke2-01 rke2[734]: time="2024-04-08T15:41:39Z" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: <https://127.0.0.1:9345/v1-rke2>>
Apr 08 15:41:45 aio-rke2-01 rke2[734]: time="2024-04-08T15:41:45Z" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: <https://127.0.0.1:9345/v1-rke2>>
Apr 08 15:41:45 aio-rke2-01 rke2[734]: {"level":"warn","ts":"2024-04-08T15:41:45.667884Z","logger":"etcd-client","caller":"v3@v3.5.9-k3s1/retry_interceptor.go:62","msg":"retrying o>
Apr 08 15:41:45 aio-rke2-01 rke2[734]: {"level":"info","ts":"2024-04-08T15:41:45.667968Z","logger":"etcd-client","caller":"v3@v3.5.9-k3s1/client.go:210","msg":"Auto sync endpoints >
Apr 08 15:41:50 aio-rke2-01 rke2[734]: time="2024-04-08T15:41:50Z" level=warning msg="Failed to list nodes with etcd role: runtime core not ready"
Apr 08 15:41:50 aio-rke2-01 rke2[734]: time="2024-04-08T15:41:50Z" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: <https://127.0.0.1:9345/v1-rke2>>
Apr 08 15:41:55 aio-rke2-01 rke2[734]: time="2024-04-08T15:41:55Z" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: <https://127.0.0.1:9345/v1-rke2>>
a
is etcd running on the node?
c
ok no nvm
it just took a lot
but it worked
now the control plane and etcd is working
a
ok
c
just idk i believe the same behavior can be replicated also if control planes are > 1..
the issue is the kube-proxy not starting when there are 3 control planes
a
Maybe just needs more then than 1 node 🙂
how long did you want
c
i waited like 3-4 minutes
but before when i asked in slack
i waited like 20 minutes for the rke2-server to start
and the issue was the same kube-proxy not starting. if there was some logic i believe we could recover even if there are 3 nodes
a
context deadline is 10minutes
by default
c
yeah and it doesn't start..
when there are 3 control planes, but it doesn't say "this or that" is corrupted, just the kube-proxy logic isn't handled properly and it breaks at the kube-proxy startup
a
Yeah, the problem is that etcd isn't up - 08T131002Z" level=error msg="Failed to check local etcd status for learner management: context deadline exceeded"
c
Yeah but the issue is closed..
In reality there still is an issue
No no, the only thing that isn't up in the 3 controller scenario is the kube-proxy
a
because they couldn't reproduce 🙂
typically
c
hmm
i believe it is pretty straight forward to reproduce
have 3 all in ones, power off all 3
then boot up 1 vm
a
Then open an issue if it's reproducible
but have in mind that it can work differently in other environments
i
Perhaps another detail to keep in mind: if you have control plane on 3 nodes, plus n workers, and the control planes o different hardware (perhaps virtual, but no on the same host), you can still run into trouble, if you have a power outage. In that case (even if unlikely), you need a plan to get started again. Mine would be to restore etcd from a backup. And have backups made every 5 minutes or so.
115 Views