This message was deleted Rancher Users #general

Join Slack

This message was deleted.

# general

adamant-kite-43734

02/24/2023, 4:46 AM

This message was deleted.

red-waitress-37932

02/24/2023, 4:53 AM

this is after I deleted most nodes in order to roll out a new VM template

bulky-sunset-52084

02/24/2023, 7:09 AM

Looks like etcd lost quorum. So it's in a read only state. You probably need to restore this cluster from backup. Always add - then remove one at a time.

red-waitress-37932

02/24/2023, 7:09 AM

oh, so i did break it 😄

red-waitress-37932

02/24/2023, 7:10 AM

can i fix it removing all but one etc node?

bulky-sunset-52084

02/24/2023, 7:12 AM

Not likely etcd is in a protected read only mode right now. Deleting a 'node' in kubernetes requires a write to etcd to update :/

red-waitress-37932

02/24/2023, 7:12 AM

unfortunate

bulky-sunset-52084

02/24/2023, 7:14 AM

Yea etcd can be a bit of a pain. When deleting etcd node always go one at a time. Make sure the replacement is up fully before starting the next one

red-waitress-37932

02/24/2023, 7:21 AM

I wanna try to connect to the etcd. where can i find the certs and all?

bulky-sunset-52084

02/24/2023, 7:22 AM

What k8s distro?

red-waitress-37932

02/24/2023, 7:23 AM

i installed it through rancher

bulky-sunset-52084

02/24/2023, 7:23 AM

Ok RKE1 rke2 or k3s?

bulky-sunset-52084

02/24/2023, 7:23 AM

They're all gonna be a bit different

red-waitress-37932

02/24/2023, 7:23 AM

uhhh, 1 i think?

red-waitress-37932

02/24/2023, 7:25 AM

shouldnt rancher know all this?

red-waitress-37932

02/24/2023, 7:26 AM

I found the ca cert in the cluster object in the api

red-waitress-37932

02/24/2023, 7:26 AM

but not the client cert+key

bulky-sunset-52084

02/24/2023, 7:27 AM

So etcd uses mTLS rancher talks to the API within the cluster that talks to etcd rancher doesn't talk to etcd directly

red-waitress-37932

02/24/2023, 7:27 AM

red-waitress-37932

02/24/2023, 7:28 AM

so basically it'S in some Secret?

bulky-sunset-52084

02/24/2023, 7:29 AM

There is a path on the nodes that have both one sec gotta look it up for RKE1 I Believe it's in /var/lib/rancher but I could be mistaken

red-waitress-37932

02/24/2023, 7:29 AM

on the etcd noder i guess?

bulky-sunset-52084

02/24/2023, 7:30 AM

Yep

red-waitress-37932

02/24/2023, 7:30 AM

btw, just restarting that last etcd node wouldnt help, right?

bulky-sunset-52084

02/24/2023, 7:31 AM

Not likely I mean feel free to try anything

red-waitress-37932

02/24/2023, 7:31 AM

probably the opposite

red-waitress-37932

02/24/2023, 7:33 AM

there's only /var/lib/rancher/rke/log

bulky-sunset-52084

02/24/2023, 7:33 AM

Well you really can hurt anything at this point the cluster is in a pretty bad state. I'm assuming this but to prove it out I would look at what the cluster thinks it's state is with kubectl get nodes

bulky-sunset-52084

02/24/2023, 7:33 AM

Maybe under /etc/rancher/ssl then?

red-waitress-37932

02/24/2023, 7:33 AM

kubectl get nodes only lists xout-ops-control-etc2 and xout-ops-worker1

bulky-sunset-52084

02/24/2023, 7:34 AM

Sorry on phone now having trouble validating

red-waitress-37932

02/24/2023, 7:34 AM

no /etc/rancher 😕

bulky-sunset-52084

02/24/2023, 7:35 AM

If you can kubectl see if you can create or delete any pod on the cluster

red-waitress-37932

02/24/2023, 7:35 AM

there is a 12-minute-old pod

bulky-sunset-52084

02/24/2023, 7:39 AM

Can you create/delete anything on the cluster. I'm assuming etcd is dead based on one picture.

red-waitress-37932

02/24/2023, 7:41 AM

deleting works

bulky-sunset-52084

02/24/2023, 7:43 AM

Cool! I think this doc will be pretty helpful too https://ranchermanager.docs.rancher.com/v2.6/troubleshooting/kubernetes-components/troubleshooting-etcd-nodes

bulky-sunset-52084

02/24/2023, 7:43 AM

If you can run the member list on the etcd node it should tell you the state of the etcd cluster

red-waitress-37932

02/24/2023, 7:44 AM

ah, one of the other nodes is still in

bulky-sunset-52084

02/24/2023, 7:45 AM

Deleting working is a good sign. So maybe what is really happening is there is an issue with the cattle cluster agent pod and it might be worth investigating it

red-waitress-37932

02/24/2023, 7:45 AM

Copy code

root@xout-ops-control-etc2:/# docker exec etcd etcdctl member list
7e3c56690e6f1ef9, started, etcd-xout-ops-worker-etc2, <https://172.20.32.90:2380>, <https://172.20.32.90:2379>, false
988bf2ddc2b76293, started, etcd-xout-ops-control-etc2, <https://172.20.32.75:2380>, <https://172.20.32.75:2379>, false

bulky-sunset-52084

02/24/2023, 7:47 AM

Odd both nodes have the same name?

red-waitress-37932

02/24/2023, 7:47 AM

no, one is named worker instead of control

bulky-sunset-52084

02/24/2023, 7:48 AM

Ohh I see now

red-waitress-37932

02/24/2023, 7:48 AM

if I could remove that other node, it might be fine

bulky-sunset-52084

02/24/2023, 7:48 AM

So maybe you have 2 etcd nodes active which means quorum :)

red-waitress-37932

02/24/2023, 7:49 AM

root@xout-ops-control-etc2:/# docker exec etcd etcdctl member remove 7e3c56690e6f1ef9 Member 7e3c56690e6f1ef9 removed from cluster 5ef1fc71a30e1d37

red-waitress-37932

02/24/2023, 7:49 AM

let's see... 🙂

red-waitress-37932

02/24/2023, 7:52 AM

getting a different error in the rancher log now:

Copy code

2023/02/24 07:51:26 [ERROR] Unknown error: Operation cannot be fulfilled on <http://preferences.management.cattle.io|preferences.management.cattle.io> "cluster": the object has been modified; please apply your changes to the latest version and try again

bulky-sunset-52084

02/24/2023, 7:54 AM

Is this log from the rancher pods or the cattle cluster agent on the downstream cluster?

red-waitress-37932

02/24/2023, 7:54 AM

from the rancher docker container on the VM that hosts rancher itself

red-waitress-37932

02/24/2023, 7:55 AM

I'm gonna delete the 2 "waiting to register" nodes and have them recreated

bulky-sunset-52084

02/24/2023, 7:57 AM

Oh erm I would avoid using rancher in docker for anything outside playing around/testing but that point aside check the cattle cluster agent logs on the downstream cluster. Usually if rancher is having trouble talking to the downstream cluster that is the place to look

red-waitress-37932

02/24/2023, 7:58 AM

isnt that the official way to install anymore?

bulky-sunset-52084

02/24/2023, 7:59 AM

No it's really just a playing around testing thing. We do not officially support any method outside of the helm install

red-waitress-37932

02/24/2023, 7:59 AM

but helm install where? 😄

red-waitress-37932

02/24/2023, 7:59 AM

bit of a hen-egg problem

bulky-sunset-52084

02/24/2023, 8:01 AM

Well you set up k8s on a cluster then helm install rancher into that. In fact the docker container method is actually a full blown kubernetes (k3s) cluster with rancher helm installed onto it in a docker image

bulky-sunset-52084

02/24/2023, 8:05 AM

It would be better to just set up a single node k3s yourself and helm install rancher that way. All you need to do to get k3s up and running on a Linux node is

curl -sfL <https://get.k3s.io/|https://get.k3s.io/> | sh -

red-waitress-37932

02/24/2023, 8:17 AM

Copy code

2023/02/24 08:17:15 [ERROR] error syncing 'c-pcd9q/m-h2kbl': handler node-controller: waiting for node to be removed from cluster, requeuing
2023/02/24 08:17:19 [INFO] Deleted cluster node m-h2kbl [xout-ops-control-etc1]

red-waitress-37932

02/24/2023, 8:17 AM

can i get rid of that node somehow?

red-waitress-37932

02/24/2023, 8:18 AM

the VM is long gone

red-waitress-37932

02/24/2023, 8:18 AM

the api object is still there and deleting it does nothing

bulky-sunset-52084

02/24/2023, 8:19 AM

It is gone - but rancher has not received the message that it's gone so you need to look at the cattle cluster agent and see why it's not syncing with rancher

red-waitress-37932

02/24/2023, 8:22 AM

what deployment do i need to look at the logs for?

bulky-sunset-52084

02/24/2023, 8:23 AM

Cattle cluster agent. It's a pod in the downstream cluster in the cattle-system namespace

bulky-sunset-52084

02/24/2023, 8:24 AM

That pod is how rancher talks to the cluster

red-waitress-37932

02/24/2023, 8:24 AM

like this? kubectl logs -n cattle-system daemonsets/cattle-node-agent -f

red-waitress-37932

02/24/2023, 8:24 AM

time="2023-02-10T170335Z" level=error msg="Remotedialer proxy error" error="dial tcp <rancher ip>443 connect: connection refused"

red-waitress-37932

02/24/2023, 8:24 AM

ok wtf

bulky-sunset-52084

02/24/2023, 8:26 AM

There should be a cattle cluster agent pod too that might have more detail you are looking at the node agents

red-waitress-37932

02/24/2023, 8:27 AM

the thing is that error makes no sense. I'm checking for network issues right now

bulky-sunset-52084

02/24/2023, 8:28 AM

Just because you are connected to rancher on 443 doesn't mean the cluster can. Is there a firewall between the cluster and rancher by any chance?

red-waitress-37932

02/24/2023, 8:32 AM

you know what I should have done first?

red-waitress-37932

02/24/2023, 8:32 AM

look at the date 😄

red-waitress-37932

02/24/2023, 8:32 AM

this was 2 weeks ago

red-waitress-37932

02/24/2023, 8:32 AM

anyway, that's not the pod I'm looking for, apparently

red-waitress-37932

02/24/2023, 8:33 AM

cause the last message is from 2 weeks ago

red-waitress-37932

02/24/2023, 8:34 AM

I see a bunch of evicted pods in the cattle-system namespace btw

bulky-sunset-52084

02/24/2023, 8:35 AM

evicted

Yea that's not good at all

bulky-sunset-52084

02/24/2023, 8:36 AM

Ohh yea you have no non tainted nodes right now I take it?

red-waitress-37932

02/24/2023, 8:36 AM

Copy code

Message:          Pod The node had condition: [DiskPressure].

bulky-sunset-52084

02/24/2023, 8:36 AM

Or that

red-waitress-37932

02/24/2023, 8:36 AM

they both have 3.2 GB free

bulky-sunset-52084

02/24/2023, 8:38 AM

Yea that's wayyyyy under threshold disk pressure starts at like 80% the evictions are to protect the node from crashing

red-waitress-37932

02/24/2023, 8:38 AM

they dont have a lot of disk space in total because I use a vsphere volume driver

red-waitress-37932

02/24/2023, 8:38 AM

red-waitress-37932

02/24/2023, 8:39 AM

well that explains it

red-waitress-37932

02/24/2023, 8:40 AM

so I'll have to give them like 10 GB more disk space

bulky-sunset-52084

02/24/2023, 8:41 AM

We generally recommend around 100G for root per node /var especially will collect things over time.

red-waitress-37932

02/24/2023, 8:41 AM

there's a ton of pods named like k8s_POD_csi-smb-node-t2lfd_kube-system_908ad598-88c9-4ad2-af26-c711a7f39d86_0

red-waitress-37932

02/24/2023, 8:42 AM

sorry not pods, but stopped containers

red-waitress-37932

02/24/2023, 8:43 AM

100 G? hmm I dont have that kind of space 😕

bulky-sunset-52084

02/24/2023, 8:44 AM

Hard to say what exactly is creating those containers but this is what's in them: https://github.com/kubernetes-csi/csi-driver-smb

red-waitress-37932

02/24/2023, 8:44 AM

those stopped containers use the image rancher/mirrored-pause:3.6

red-waitress-37932

02/24/2023, 8:45 AM

and there's other stuff that has nothing to do with csi-driver-smb

bulky-sunset-52084

02/24/2023, 8:45 AM

100 is great but I'd really say no less than 50

red-waitress-37932

02/24/2023, 8:45 AM

like k8s_POD_rke-ingress-controller-deploy-job-jv5fd_kube-system_dc9c9384-a516-4c45-8294-20f9bb574584_0

bulky-sunset-52084

02/24/2023, 8:46 AM

That's a helmjob for installing the ingress controller that is normal

bulky-sunset-52084

02/24/2023, 8:47 AM

Info on the pause container: https://www.ianlewis.org/en/almighty-pause-container

red-waitress-37932

02/24/2023, 8:47 AM

yeah i figure it's for exec'ing stuff

red-waitress-37932

02/24/2023, 8:50 AM

but why are there so many exited pause containers?

red-waitress-37932

02/24/2023, 8:50 AM

shouldn't those be cleaned up when done? like with --rm?

bulky-sunset-52084

02/24/2023, 8:52 AM

Docker has its own garbage collection method. Typically separate from k8s and unconcerned with how much disk space you have left on the node

bulky-sunset-52084

02/24/2023, 8:55 AM

It goes deep, best bet is to allocate a reasonable amount of disk space and let docker do it's thing https://docs.docker.com/registry/garbage-collection/

red-waitress-37932

02/24/2023, 8:57 AM

well some of those were 7 weeks old, so I'm not holding my breath 🙂

red-waitress-37932

02/24/2023, 8:57 AM

I ran a docker container prune -f && docker image prune -f and it only gave me 150 MB back, so I gotta look for other candidates to get this to < 80%

bulky-sunset-52084

02/24/2023, 8:58 AM

I personally like the tool

ncdu

for disk cleanup

red-waitress-37932

02/24/2023, 8:58 AM

same

red-waitress-37932

02/24/2023, 8:59 AM

hmm why is there a swap image here?

red-waitress-37932

02/24/2023, 8:59 AM

swap is still discouraged in a kubernetes environment, right?

bulky-sunset-52084

02/24/2023, 9:00 AM

Yep swap is not recommended

red-waitress-37932

02/24/2023, 9:00 AM

ok so that's 2 GB I can get rid of, which should put it to something like 70%

red-waitress-37932

02/24/2023, 12:07 PM

ok so swap is gone, disk pressure is gone, i deleted all the evicted pods as well cluster is still broken. I think they're trying to reach the defunct node that's being removed

red-waitress-37932

02/24/2023, 12:07 PM

is there any way I can remove that api object from rancher?

red-waitress-37932

02/24/2023, 2:33 PM

I ended up recreating the cluster from flux config. worked quite well. good to have that tested, too 😄

red-waitress-37932

02/24/2023, 2:34 PM

took me a while to purge the old cluster though

346 Views

Open in Slack

Previous Next