This message was deleted.
# general
a
This message was deleted.
r
this is after I deleted most nodes in order to roll out a new VM template
b
Looks like etcd lost quorum. So it's in a read only state. You probably need to restore this cluster from backup. Always add - then remove one at a time.
r
oh, so i did break it 😄
can i fix it removing all but one etc node?
b
Not likely etcd is in a protected read only mode right now. Deleting a 'node' in kubernetes requires a write to etcd to update :/
r
unfortunate
b
Yea etcd can be a bit of a pain. When deleting etcd node always go one at a time. Make sure the replacement is up fully before starting the next one
r
I wanna try to connect to the etcd. where can i find the certs and all?
b
What k8s distro?
r
i installed it through rancher
b
Ok RKE1 rke2 or k3s?
They're all gonna be a bit different
r
uhhh, 1 i think?
shouldnt rancher know all this?
I found the ca cert in the cluster object in the api
but not the client cert+key
b
So etcd uses mTLS rancher talks to the API within the cluster that talks to etcd rancher doesn't talk to etcd directly
r
ah
so basically it'S in some Secret?
b
There is a path on the nodes that have both one sec gotta look it up for RKE1 I Believe it's in /var/lib/rancher but I could be mistaken
r
on the etcd noder i guess?
b
Yep
r
btw, just restarting that last etcd node wouldnt help, right?
b
Not likely I mean feel free to try anything
r
probably the opposite
there's only /var/lib/rancher/rke/log
b
Well you really can hurt anything at this point the cluster is in a pretty bad state. I'm assuming this but to prove it out I would look at what the cluster thinks it's state is with kubectl get nodes
Maybe under /etc/rancher/ssl then?
r
kubectl get nodes only lists xout-ops-control-etc2 and xout-ops-worker1
b
Sorry on phone now having trouble validating
r
no /etc/rancher 😕
b
If you can kubectl see if you can create or delete any pod on the cluster
r
there is a 12-minute-old pod
b
Can you create/delete anything on the cluster. I'm assuming etcd is dead based on one picture.
r
deleting works
b
If you can run the member list on the etcd node it should tell you the state of the etcd cluster
r
ah, one of the other nodes is still in
b
Deleting working is a good sign. So maybe what is really happening is there is an issue with the cattle cluster agent pod and it might be worth investigating it
r
Copy code
root@xout-ops-control-etc2:/# docker exec etcd etcdctl member list
7e3c56690e6f1ef9, started, etcd-xout-ops-worker-etc2, <https://172.20.32.90:2380>, <https://172.20.32.90:2379>, false
988bf2ddc2b76293, started, etcd-xout-ops-control-etc2, <https://172.20.32.75:2380>, <https://172.20.32.75:2379>, false
b
Odd both nodes have the same name?
r
no, one is named worker instead of control
b
Ohh I see now
r
if I could remove that other node, it might be fine
b
So maybe you have 2 etcd nodes active which means quorum :)
r
root@xout-ops-control-etc2:/# docker exec etcd etcdctl member remove 7e3c56690e6f1ef9 Member 7e3c56690e6f1ef9 removed from cluster 5ef1fc71a30e1d37
let's see... 🙂
getting a different error in the rancher log now:
Copy code
2023/02/24 07:51:26 [ERROR] Unknown error: Operation cannot be fulfilled on <http://preferences.management.cattle.io|preferences.management.cattle.io> "cluster": the object has been modified; please apply your changes to the latest version and try again
b
Is this log from the rancher pods or the cattle cluster agent on the downstream cluster?
r
from the rancher docker container on the VM that hosts rancher itself
I'm gonna delete the 2 "waiting to register" nodes and have them recreated
b
Oh erm I would avoid using rancher in docker for anything outside playing around/testing but that point aside check the cattle cluster agent logs on the downstream cluster. Usually if rancher is having trouble talking to the downstream cluster that is the place to look
r
isnt that the official way to install anymore?
b
No it's really just a playing around testing thing. We do not officially support any method outside of the helm install
r
but helm install where? 😄
bit of a hen-egg problem
b
Well you set up k8s on a cluster then helm install rancher into that. In fact the docker container method is actually a full blown kubernetes (k3s) cluster with rancher helm installed onto it in a docker image
It would be better to just set up a single node k3s yourself and helm install rancher that way. All you need to do to get k3s up and running on a Linux node is
curl -sfL <https://get.k3s.io/|https://get.k3s.io/> | sh -
r
Copy code
2023/02/24 08:17:15 [ERROR] error syncing 'c-pcd9q/m-h2kbl': handler node-controller: waiting for node to be removed from cluster, requeuing
2023/02/24 08:17:19 [INFO] Deleted cluster node m-h2kbl [xout-ops-control-etc1]
can i get rid of that node somehow?
the VM is long gone
the api object is still there and deleting it does nothing
b
It is gone - but rancher has not received the message that it's gone so you need to look at the cattle cluster agent and see why it's not syncing with rancher
r
what deployment do i need to look at the logs for?
b
Cattle cluster agent. It's a pod in the downstream cluster in the cattle-system namespace
That pod is how rancher talks to the cluster
r
like this? kubectl logs -n cattle-system daemonsets/cattle-node-agent -f
time="2023-02-10T170335Z" level=error msg="Remotedialer proxy error" error="dial tcp <rancher ip>443 connect: connection refused"
ok wtf
b
There should be a cattle cluster agent pod too that might have more detail you are looking at the node agents
r
the thing is that error makes no sense. I'm checking for network issues right now
b
Just because you are connected to rancher on 443 doesn't mean the cluster can. Is there a firewall between the cluster and rancher by any chance?
r
you know what I should have done first?
look at the date 😄
this was 2 weeks ago
anyway, that's not the pod I'm looking for, apparently
cause the last message is from 2 weeks ago
I see a bunch of evicted pods in the cattle-system namespace btw
b
evicted
Yea that's not good at all
Ohh yea you have no non tainted nodes right now I take it?
r
Copy code
Message:          Pod The node had condition: [DiskPressure].
b
Or that
r
they both have 3.2 GB free
b
Yea that's wayyyyy under threshold disk pressure starts at like 80% the evictions are to protect the node from crashing
r
they dont have a lot of disk space in total because I use a vsphere volume driver
oh
well that explains it
so I'll have to give them like 10 GB more disk space
b
We generally recommend around 100G for root per node /var especially will collect things over time.
r
there's a ton of pods named like k8s_POD_csi-smb-node-t2lfd_kube-system_908ad598-88c9-4ad2-af26-c711a7f39d86_0
sorry not pods, but stopped containers
100 G? hmm I dont have that kind of space 😕
b
Hard to say what exactly is creating those containers but this is what's in them: https://github.com/kubernetes-csi/csi-driver-smb
r
those stopped containers use the image rancher/mirrored-pause:3.6
and there's other stuff that has nothing to do with csi-driver-smb
b
100 is great but I'd really say no less than 50
r
like k8s_POD_rke-ingress-controller-deploy-job-jv5fd_kube-system_dc9c9384-a516-4c45-8294-20f9bb574584_0
b
That's a helmjob for installing the ingress controller that is normal
r
yeah i figure it's for exec'ing stuff
but why are there so many exited pause containers?
shouldn't those be cleaned up when done? like with --rm?
b
Docker has its own garbage collection method. Typically separate from k8s and unconcerned with how much disk space you have left on the node
It goes deep, best bet is to allocate a reasonable amount of disk space and let docker do it's thing https://docs.docker.com/registry/garbage-collection/
r
well some of those were 7 weeks old, so I'm not holding my breath 🙂
I ran a docker container prune -f && docker image prune -f and it only gave me 150 MB back, so I gotta look for other candidates to get this to < 80%
b
I personally like the tool
ncdu
for disk cleanup
r
same
hmm why is there a swap image here?
swap is still discouraged in a kubernetes environment, right?
b
Yep swap is not recommended
r
ok so that's 2 GB I can get rid of, which should put it to something like 70%
ok so swap is gone, disk pressure is gone, i deleted all the evicted pods as well cluster is still broken. I think they're trying to reach the defunct node that's being removed
is there any way I can remove that api object from rancher?
I ended up recreating the cluster from flux config. worked quite well. good to have that tested, too 😄
took me a while to purge the old cluster though
343 Views