This message was deleted.
# general
a
This message was deleted.
c
check the etcd and apiserver logs under /var/log/pods to see why they’re not starting?
👀 1
a
I’m just seeing a lot of connection refused to various ports on 127.0.0.1. I think that 6443 is the main k8s port but I’m not sure which pod/container is running that. Right now only the etcd container starts and is running, all the rest exited.
c
6443 is the apiserver pod
etcd should come up, then apiserver
a
I don’t think it’s even trying to start anything other than etc now:
The API pod log stops 7 hours ago and ends with:
Which looks to my eyes like it’s failing to connect to whatever should be running locally on 2379 (etc?)
The etcd service which is running has lots of errors failing to connect to 2380 on the other 2 x etcd nodes in the cluster - and those ports are closed on those servers
c
ok so why isn’t it running on those?
etcd needs quorum to operate. This node won’t come up until at least one other node comes up to provide 2/3 majority.
a
Ok (thanks so much!) - off to those nodes to check their etcd services!!
Hmm… doesn’t look like they’ve run since 16:28 or even tried to start up. On these nodes (etcd notes) there’s no /run/k3s folder to have a look at running containers. The rke2-server service shows errors like this in the log: time=“2024-11-25T232059Z” level=info msg=“Waiting for cri connection: rpc error: code = Unavailable desc = connection error: desc = \“transport: Error while dialing: dial unix /run/k3s/containerd/containerd.sock: connect: no such file or directory\“”
c
is containerd failing to start for some reason?
is the rke2-server service enabled and running?
a
containerd is running fine - it seems that on the other node there’s two containerd.sock’s; the main /run/containerd.sock which has nothing running under it and the /run/k3s/containerd.sock which runs the actual workloads
On all nodes rke2-server is not starting correctly
c
check
/var/lib/rancher/rke2/agent/containerd/containerd.log
to see why containerd is failing to start
a
It seems to be unable to connect to the local etcd endpoint
c
or if it is starting, check the pod logs in /var/log/pods
a
The logs imply it’s not been started since 16:28 which is when everything died
There’s no errors, various info etc, and nothing from the hours since when reboots have occurred and attempts to start rke2-server
And I see exactly the same logs stop at 16:28 on the other etcd node
c
that looks like a running containerd to me. but you said the socket isn’t there?
according to that log it is listening at /run/k3s/containerd/containerd.sock
a
Those logs are from 16:30 this afternoon, it’s 23:45 here now 😞 I think then it was running to that socket. The logs have no entries after this so I don’t think it’s being started somehow.
The rke2-server has logs until now but none of the pods (which makes sense if the rke2 containerd isn’t running which it’s not even trying to start)
c
ok so have you tried starting the service?
a
rke2-server? Yes
It hangs after a systemctl start
c
you don’t need to wait for that command to exit…
just do that with --no-block, or control-c out and go look at the logs
a
The logs from another terminal window
c
you should see it starting containerd, at which point the containerd logs should have new entries
what is that 143.117.208.68 address
is that node up yet?
a
.68 is the control plane node that isn’t starting… seemingly because the etcd nodes aren’t started
That’s the node with only the etcd service on it, hanging until the other etcd nodes come up 😞
c
it should be able to get as far as serving certificates…
you’re on 1.25 though which is quite old
try removing the
server
line from the config at /etc/rancher/rke2/config.yaml
a
I know it’s quite old… worked until today and on the cards to upgrade soon (perhaps I put it off too long!)
c
remove the server line from the config, and then restart the service
a
Oooooh, that finished starting and looks like it’s up
And /run/k3s/containerd is there
c
etcd and the apiserver should be up now, then?
a
Yes - all of those are running 🙂
The cluster just says it’s “waiting for cluster agent to connect” on that node now
Still on that, though rancher-system-agent seems healthy and is showing active with no obvious errors in the logs
c
rancher system agent is not the cluster agent, that’s the node agent
a
Ah! That would explain it then.
c
you’d want to look at the cluster agent deployment in the cattle-system namespace
make sure that pod is running
a
Using the crictl to k3s containerd? Or ctr directly? (Sorry I’m very new to containerd)
I can only see a ns moby with ctr
fleet-agent is a pod running in k3s
In the fleet-agent pod I’m getting DNS errors for the FQDN of the rancher server. DNS is working for this on the system so I guess it’s the internal DNS service that’s not resolving it.
rancher2.qubcloud.uk is the rancher server (as you may guess!)
The coredns pod is running and has some errors about slow event handlers
Ah though it’s now exited
So this seems to be the last piece of the puzzle… the coredns pod keep exiting with no obvious error
As far as I can tell the issue is DNS resolution within the fleet-agent pod so the agent cannot register back with rancher. The coredns containers keep exiting (they complain about the healthcheck taking >1s but I’m not sure if this is the cause). The fleet-agent cannot connect via UDP to 10.43.0.10 which is (I believe) the main DNS service of the cluster (and I guess served from the control plane i.e. from the very coredns services that are restarting?). I tried to add the IP manually to the hosts file within the fleet-agent BUT don’t have access as exec runs under a non-root user in the pod. The long delays in the coredns health shouldn’t be capacity linked - the controlplane has 32 vCPU and 32GB of RAM and is barely troubling either.
c
are all three servers running normally now? etcd and apiserver and kube-proxy are all healthy?
a
Yes as far as I can tell all those are fine, running happily
(as in the services are up, not exiting, not complaining and the probes all pass ok in the rancher interface)
I’ve just rebooted to do a clean startup and the only thing I can see in error is the calico-node container
Oh - nope that’s running ok now it seems
Logs look good including in /var/log/pods for calico-node, must have just been restarted during startup
Just that DNS error in fleet-agent 😞
So far then I have identified: (1) fleet-agent is failing to get a DNS resolution from 10.42.0.10 and (2) the coredns service on the control plane (same one that fleet-agent is also failing on) keeps exiting. I am unsure if this is a network issue (I’ve seen errors in coredns about failing to connect to our internal DNS servers) or just an issue with coredns (or a network issue from fleet-agent to coredns given the connection refused error though it’s UDP so I guess not much to actually refuse).