https://rancher.com/ logo
Title
r

red-waitress-37932

02/24/2023, 4:46 AM
I'm on Rancher 2.6.8 and it's waiting forever to remove a node from the cluster. about 2 weeks so far. What should I do here?
this is after I deleted most nodes in order to roll out a new VM template
b

bulky-sunset-52084

02/24/2023, 7:09 AM
Looks like etcd lost quorum. So it's in a read only state. You probably need to restore this cluster from backup. Always add - then remove one at a time.
r

red-waitress-37932

02/24/2023, 7:09 AM
oh, so i did break it ๐Ÿ˜„
can i fix it removing all but one etc node?
b

bulky-sunset-52084

02/24/2023, 7:12 AM
Not likely etcd is in a protected read only mode right now. Deleting a 'node' in kubernetes requires a write to etcd to update :/
r

red-waitress-37932

02/24/2023, 7:12 AM
unfortunate
b

bulky-sunset-52084

02/24/2023, 7:14 AM
Yea etcd can be a bit of a pain. When deleting etcd node always go one at a time. Make sure the replacement is up fully before starting the next one
r

red-waitress-37932

02/24/2023, 7:21 AM
I wanna try to connect to the etcd. where can i find the certs and all?
b

bulky-sunset-52084

02/24/2023, 7:22 AM
What k8s distro?
r

red-waitress-37932

02/24/2023, 7:23 AM
i installed it through rancher
b

bulky-sunset-52084

02/24/2023, 7:23 AM
Ok RKE1 rke2 or k3s?
They're all gonna be a bit different
r

red-waitress-37932

02/24/2023, 7:23 AM
uhhh, 1 i think?
shouldnt rancher know all this?
I found the ca cert in the cluster object in the api
but not the client cert+key
b

bulky-sunset-52084

02/24/2023, 7:27 AM
So etcd uses mTLS rancher talks to the API within the cluster that talks to etcd rancher doesn't talk to etcd directly
r

red-waitress-37932

02/24/2023, 7:27 AM
ah
so basically it'S in some Secret?
b

bulky-sunset-52084

02/24/2023, 7:29 AM
There is a path on the nodes that have both one sec gotta look it up for RKE1 I Believe it's in /var/lib/rancher but I could be mistaken
r

red-waitress-37932

02/24/2023, 7:29 AM
on the etcd noder i guess?
b

bulky-sunset-52084

02/24/2023, 7:30 AM
Yep
r

red-waitress-37932

02/24/2023, 7:30 AM
btw, just restarting that last etcd node wouldnt help, right?
b

bulky-sunset-52084

02/24/2023, 7:31 AM
Not likely I mean feel free to try anything
r

red-waitress-37932

02/24/2023, 7:31 AM
probably the opposite
there's only /var/lib/rancher/rke/log
b

bulky-sunset-52084

02/24/2023, 7:33 AM
Well you really can hurt anything at this point the cluster is in a pretty bad state. I'm assuming this but to prove it out I would look at what the cluster thinks it's state is with kubectl get nodes
Maybe under /etc/rancher/ssl then?
r

red-waitress-37932

02/24/2023, 7:33 AM
kubectl get nodes only lists xout-ops-control-etc2 and xout-ops-worker1
b

bulky-sunset-52084

02/24/2023, 7:34 AM
Sorry on phone now having trouble validating
r

red-waitress-37932

02/24/2023, 7:34 AM
no /etc/rancher ๐Ÿ˜•
b

bulky-sunset-52084

02/24/2023, 7:35 AM
If you can kubectl see if you can create or delete any pod on the cluster
r

red-waitress-37932

02/24/2023, 7:35 AM
there is a 12-minute-old pod
b

bulky-sunset-52084

02/24/2023, 7:39 AM
Can you create/delete anything on the cluster. I'm assuming etcd is dead based on one picture.
r

red-waitress-37932

02/24/2023, 7:41 AM
deleting works
b

bulky-sunset-52084

02/24/2023, 7:43 AM
If you can run the member list on the etcd node it should tell you the state of the etcd cluster
r

red-waitress-37932

02/24/2023, 7:44 AM
ah, one of the other nodes is still in
b

bulky-sunset-52084

02/24/2023, 7:45 AM
Deleting working is a good sign. So maybe what is really happening is there is an issue with the cattle cluster agent pod and it might be worth investigating it
r

red-waitress-37932

02/24/2023, 7:45 AM
root@xout-ops-control-etc2:/# docker exec etcd etcdctl member list
7e3c56690e6f1ef9, started, etcd-xout-ops-worker-etc2, <https://172.20.32.90:2380>, <https://172.20.32.90:2379>, false
988bf2ddc2b76293, started, etcd-xout-ops-control-etc2, <https://172.20.32.75:2380>, <https://172.20.32.75:2379>, false
b

bulky-sunset-52084

02/24/2023, 7:47 AM
Odd both nodes have the same name?
r

red-waitress-37932

02/24/2023, 7:47 AM
no, one is named worker instead of control
b

bulky-sunset-52084

02/24/2023, 7:48 AM
Ohh I see now
r

red-waitress-37932

02/24/2023, 7:48 AM
if I could remove that other node, it might be fine
b

bulky-sunset-52084

02/24/2023, 7:48 AM
So maybe you have 2 etcd nodes active which means quorum :)
r

red-waitress-37932

02/24/2023, 7:49 AM
root@xout-ops-control-etc2:/# docker exec etcd etcdctl member remove 7e3c56690e6f1ef9 Member 7e3c56690e6f1ef9 removed from cluster 5ef1fc71a30e1d37
let's see... ๐Ÿ™‚
getting a different error in the rancher log now:
2023/02/24 07:51:26 [ERROR] Unknown error: Operation cannot be fulfilled on <http://preferences.management.cattle.io|preferences.management.cattle.io> "cluster": the object has been modified; please apply your changes to the latest version and try again
b

bulky-sunset-52084

02/24/2023, 7:54 AM
Is this log from the rancher pods or the cattle cluster agent on the downstream cluster?
r

red-waitress-37932

02/24/2023, 7:54 AM
from the rancher docker container on the VM that hosts rancher itself
I'm gonna delete the 2 "waiting to register" nodes and have them recreated
b

bulky-sunset-52084

02/24/2023, 7:57 AM
Oh erm I would avoid using rancher in docker for anything outside playing around/testing but that point aside check the cattle cluster agent logs on the downstream cluster. Usually if rancher is having trouble talking to the downstream cluster that is the place to look
r

red-waitress-37932

02/24/2023, 7:58 AM
isnt that the official way to install anymore?
b

bulky-sunset-52084

02/24/2023, 7:59 AM
No it's really just a playing around testing thing. We do not officially support any method outside of the helm install
r

red-waitress-37932

02/24/2023, 7:59 AM
but helm install where? ๐Ÿ˜„
bit of a hen-egg problem
b

bulky-sunset-52084

02/24/2023, 8:01 AM
Well you set up k8s on a cluster then helm install rancher into that. In fact the docker container method is actually a full blown kubernetes (k3s) cluster with rancher helm installed onto it in a docker image
It would be better to just set up a single node k3s yourself and helm install rancher that way. All you need to do to get k3s up and running on a Linux node is
curl -sfL <https://get.k3s.io/|https://get.k3s.io/> | sh -
r

red-waitress-37932

02/24/2023, 8:17 AM
2023/02/24 08:17:15 [ERROR] error syncing 'c-pcd9q/m-h2kbl': handler node-controller: waiting for node to be removed from cluster, requeuing
2023/02/24 08:17:19 [INFO] Deleted cluster node m-h2kbl [xout-ops-control-etc1]
can i get rid of that node somehow?
the VM is long gone
the api object is still there and deleting it does nothing
b

bulky-sunset-52084

02/24/2023, 8:19 AM
It is gone - but rancher has not received the message that it's gone so you need to look at the cattle cluster agent and see why it's not syncing with rancher
r

red-waitress-37932

02/24/2023, 8:22 AM
what deployment do i need to look at the logs for?
b

bulky-sunset-52084

02/24/2023, 8:23 AM
Cattle cluster agent. It's a pod in the downstream cluster in the cattle-system namespace
That pod is how rancher talks to the cluster
r

red-waitress-37932

02/24/2023, 8:24 AM
like this? kubectl logs -n cattle-system daemonsets/cattle-node-agent -f
time="2023-02-10T17:03:35Z" level=error msg="Remotedialer proxy error" error="dial tcp <rancher ip>:443: connect: connection refused"
ok wtf
b

bulky-sunset-52084

02/24/2023, 8:26 AM
There should be a cattle cluster agent pod too that might have more detail you are looking at the node agents
r

red-waitress-37932

02/24/2023, 8:27 AM
the thing is that error makes no sense. I'm checking for network issues right now
b

bulky-sunset-52084

02/24/2023, 8:28 AM
Just because you are connected to rancher on 443 doesn't mean the cluster can. Is there a firewall between the cluster and rancher by any chance?
r

red-waitress-37932

02/24/2023, 8:32 AM
you know what I should have done first?
look at the date ๐Ÿ˜„
this was 2 weeks ago
anyway, that's not the pod I'm looking for, apparently
cause the last message is from 2 weeks ago
I see a bunch of evicted pods in the cattle-system namespace btw
b

bulky-sunset-52084

02/24/2023, 8:35 AM
evicted
Yea that's not good at all
Ohh yea you have no non tainted nodes right now I take it?
r

red-waitress-37932

02/24/2023, 8:36 AM
Message:          Pod The node had condition: [DiskPressure].
b

bulky-sunset-52084

02/24/2023, 8:36 AM
Or that
r

red-waitress-37932

02/24/2023, 8:36 AM
they both have 3.2 GB free
b

bulky-sunset-52084

02/24/2023, 8:38 AM
Yea that's wayyyyy under threshold disk pressure starts at like 80% the evictions are to protect the node from crashing
r

red-waitress-37932

02/24/2023, 8:38 AM
they dont have a lot of disk space in total because I use a vsphere volume driver
oh
well that explains it
so I'll have to give them like 10 GB more disk space
b

bulky-sunset-52084

02/24/2023, 8:41 AM
We generally recommend around 100G for root per node /var especially will collect things over time.
r

red-waitress-37932

02/24/2023, 8:41 AM
there's a ton of pods named like k8s_POD_csi-smb-node-t2lfd_kube-system_908ad598-88c9-4ad2-af26-c711a7f39d86_0
sorry not pods, but stopped containers
100 G? hmm I dont have that kind of space ๐Ÿ˜•
b

bulky-sunset-52084

02/24/2023, 8:44 AM
Hard to say what exactly is creating those containers but this is what's in them: https://github.com/kubernetes-csi/csi-driver-smb
r

red-waitress-37932

02/24/2023, 8:44 AM
those stopped containers use the image rancher/mirrored-pause:3.6
and there's other stuff that has nothing to do with csi-driver-smb
b

bulky-sunset-52084

02/24/2023, 8:45 AM
100 is great but I'd really say no less than 50
r

red-waitress-37932

02/24/2023, 8:45 AM
like k8s_POD_rke-ingress-controller-deploy-job-jv5fd_kube-system_dc9c9384-a516-4c45-8294-20f9bb574584_0
b

bulky-sunset-52084

02/24/2023, 8:46 AM
That's a helmjob for installing the ingress controller that is normal
r

red-waitress-37932

02/24/2023, 8:47 AM
yeah i figure it's for exec'ing stuff
but why are there so many exited pause containers?
shouldn't those be cleaned up when done? like with --rm?
b

bulky-sunset-52084

02/24/2023, 8:52 AM
Docker has its own garbage collection method. Typically separate from k8s and unconcerned with how much disk space you have left on the node
It goes deep, best bet is to allocate a reasonable amount of disk space and let docker do it's thing https://docs.docker.com/registry/garbage-collection/
r

red-waitress-37932

02/24/2023, 8:57 AM
well some of those were 7 weeks old, so I'm not holding my breath ๐Ÿ™‚
I ran a docker container prune -f && docker image prune -f and it only gave me 150 MB back, so I gotta look for other candidates to get this to < 80%
b

bulky-sunset-52084

02/24/2023, 8:58 AM
I personally like the tool
ncdu
for disk cleanup
r

red-waitress-37932

02/24/2023, 8:58 AM
same
hmm why is there a swap image here?
swap is still discouraged in a kubernetes environment, right?
b

bulky-sunset-52084

02/24/2023, 9:00 AM
Yep swap is not recommended
r

red-waitress-37932

02/24/2023, 9:00 AM
ok so that's 2 GB I can get rid of, which should put it to something like 70%
ok so swap is gone, disk pressure is gone, i deleted all the evicted pods as well cluster is still broken. I think they're trying to reach the defunct node that's being removed
is there any way I can remove that api object from rancher?
I ended up recreating the cluster from flux config. worked quite well. good to have that tested, too ๐Ÿ˜„
took me a while to purge the old cluster though