This message was deleted Rancher Users #general

Join Slack

This message was deleted.

# general

adamant-kite-43734

11/25/2024, 7:54 PM

This message was deleted.

creamy-pencil-82913

11/25/2024, 7:59 PM

check the etcd and apiserver logs under /var/log/pods to see why they’re not starting?

👀 1

abundant-toothbrush-78602

11/25/2024, 10:43 PM

I’m just seeing a lot of connection refused to various ports on 127.0.0.1. I think that 6443 is the main k8s port but I’m not sure which pod/container is running that. Right now only the etcd container starts and is running, all the rest exited.

creamy-pencil-82913

11/25/2024, 10:48 PM

6443 is the apiserver pod

creamy-pencil-82913

11/25/2024, 10:49 PM

etcd should come up, then apiserver

abundant-toothbrush-78602

11/25/2024, 11:06 PM

I don’t think it’s even trying to start anything other than etc now:

abundant-toothbrush-78602

11/25/2024, 11:07 PM

The API pod log stops 7 hours ago and ends with:

abundant-toothbrush-78602

11/25/2024, 11:07 PM

Which looks to my eyes like it’s failing to connect to whatever should be running locally on 2379 (etc?)

abundant-toothbrush-78602

11/25/2024, 11:10 PM

The etcd service which is running has lots of errors failing to connect to 2380 on the other 2 x etcd nodes in the cluster - and those ports are closed on those servers

creamy-pencil-82913

11/25/2024, 11:10 PM

ok so why isn’t it running on those?

creamy-pencil-82913

11/25/2024, 11:11 PM

etcd needs quorum to operate. This node won’t come up until at least one other node comes up to provide 2/3 majority.

abundant-toothbrush-78602

11/25/2024, 11:12 PM

Ok (thanks so much!) - off to those nodes to check their etcd services!!

abundant-toothbrush-78602

11/25/2024, 11:22 PM

Hmm… doesn’t look like they’ve run since 16:28 or even tried to start up. On these nodes (etcd notes) there’s no /run/k3s folder to have a look at running containers. The rke2-server service shows errors like this in the log: time=“2024-11-25T232059Z” level=info msg=“Waiting for cri connection: rpc error: code = Unavailable desc = connection error: desc = \“transport: Error while dialing: dial unix /run/k3s/containerd/containerd.sock: connect: no such file or directory\“”

creamy-pencil-82913

11/25/2024, 11:32 PM

is containerd failing to start for some reason?

creamy-pencil-82913

11/25/2024, 11:33 PM

is the rke2-server service enabled and running?

abundant-toothbrush-78602

11/25/2024, 11:34 PM

containerd is running fine - it seems that on the other node there’s two containerd.sock’s; the main /run/containerd.sock which has nothing running under it and the /run/k3s/containerd.sock which runs the actual workloads

abundant-toothbrush-78602

11/25/2024, 11:34 PM

On all nodes rke2-server is not starting correctly

creamy-pencil-82913

11/25/2024, 11:35 PM

check

/var/lib/rancher/rke2/agent/containerd/containerd.log

to see why containerd is failing to start

abundant-toothbrush-78602

11/25/2024, 11:35 PM

It seems to be unable to connect to the local etcd endpoint

creamy-pencil-82913

11/25/2024, 11:35 PM

or if it is starting, check the pod logs in /var/log/pods

abundant-toothbrush-78602

11/25/2024, 11:36 PM

The logs imply it’s not been started since 16:28 which is when everything died

abundant-toothbrush-78602

11/25/2024, 11:36 PM

There’s no errors, various info etc, and nothing from the hours since when reboots have occurred and attempts to start rke2-server

abundant-toothbrush-78602

11/25/2024, 11:38 PM

And I see exactly the same logs stop at 16:28 on the other etcd node

creamy-pencil-82913

11/25/2024, 11:44 PM

that looks like a running containerd to me. but you said the socket isn’t there?

creamy-pencil-82913

11/25/2024, 11:44 PM

according to that log it is listening at /run/k3s/containerd/containerd.sock

abundant-toothbrush-78602

11/25/2024, 11:46 PM

Those logs are from 16:30 this afternoon, it’s 23:45 here now 😞 I think then it was running to that socket. The logs have no entries after this so I don’t think it’s being started somehow.

abundant-toothbrush-78602

11/25/2024, 11:46 PM

The rke2-server has logs until now but none of the pods (which makes sense if the rke2 containerd isn’t running which it’s not even trying to start)

creamy-pencil-82913

11/25/2024, 11:46 PM

ok so have you tried starting the service?

abundant-toothbrush-78602

11/25/2024, 11:47 PM

rke2-server? Yes

abundant-toothbrush-78602

11/25/2024, 11:47 PM

It hangs after a systemctl start

creamy-pencil-82913

11/25/2024, 11:47 PM

you don’t need to wait for that command to exit…

creamy-pencil-82913

11/25/2024, 11:48 PM

just do that with --no-block, or control-c out and go look at the logs

abundant-toothbrush-78602

11/25/2024, 11:48 PM

The logs from another terminal window

creamy-pencil-82913

11/25/2024, 11:48 PM

you should see it starting containerd, at which point the containerd logs should have new entries

creamy-pencil-82913

11/25/2024, 11:50 PM

what is that 143.117.208.68 address

creamy-pencil-82913

11/25/2024, 11:50 PM

is that node up yet?

abundant-toothbrush-78602

11/25/2024, 11:51 PM

.68 is the control plane node that isn’t starting… seemingly because the etcd nodes aren’t started

abundant-toothbrush-78602

11/25/2024, 11:51 PM

That’s the node with only the etcd service on it, hanging until the other etcd nodes come up 😞

creamy-pencil-82913

11/25/2024, 11:52 PM

it should be able to get as far as serving certificates…

creamy-pencil-82913

11/25/2024, 11:52 PM

you’re on 1.25 though which is quite old

creamy-pencil-82913

11/25/2024, 11:52 PM

try removing the

server

line from the config at /etc/rancher/rke2/config.yaml

abundant-toothbrush-78602

11/25/2024, 11:52 PM

I know it’s quite old… worked until today and on the cards to upgrade soon (perhaps I put it off too long!)

creamy-pencil-82913

11/25/2024, 11:53 PM

remove the server line from the config, and then restart the service

abundant-toothbrush-78602

11/25/2024, 11:55 PM

Oooooh, that finished starting and looks like it’s up

abundant-toothbrush-78602

11/25/2024, 11:56 PM

And /run/k3s/containerd is there

creamy-pencil-82913

11/26/2024, 12:01 AM

etcd and the apiserver should be up now, then?

abundant-toothbrush-78602

11/26/2024, 12:07 AM

Yes - all of those are running 🙂

abundant-toothbrush-78602

11/26/2024, 12:07 AM

The cluster just says it’s “waiting for cluster agent to connect” on that node now

abundant-toothbrush-78602

11/26/2024, 12:11 AM

Still on that, though rancher-system-agent seems healthy and is showing active with no obvious errors in the logs

creamy-pencil-82913

11/26/2024, 12:12 AM

rancher system agent is not the cluster agent, that’s the node agent

abundant-toothbrush-78602

11/26/2024, 12:12 AM

Ah! That would explain it then.

creamy-pencil-82913

11/26/2024, 12:12 AM

you’d want to look at the cluster agent deployment in the cattle-system namespace

creamy-pencil-82913

11/26/2024, 12:12 AM

make sure that pod is running

abundant-toothbrush-78602

11/26/2024, 12:13 AM

Using the crictl to k3s containerd? Or ctr directly? (Sorry I’m very new to containerd)

abundant-toothbrush-78602

11/26/2024, 12:13 AM

I can only see a ns moby with ctr

abundant-toothbrush-78602

11/26/2024, 12:14 AM

fleet-agent is a pod running in k3s

abundant-toothbrush-78602

11/26/2024, 12:15 AM

In the fleet-agent pod I’m getting DNS errors for the FQDN of the rancher server. DNS is working for this on the system so I guess it’s the internal DNS service that’s not resolving it.

abundant-toothbrush-78602

11/26/2024, 12:16 AM

rancher2.qubcloud.uk is the rancher server (as you may guess!)

abundant-toothbrush-78602

11/26/2024, 12:17 AM

The coredns pod is running and has some errors about slow event handlers

abundant-toothbrush-78602

11/26/2024, 12:18 AM

Ah though it’s now exited

abundant-toothbrush-78602

11/26/2024, 12:26 AM

So this seems to be the last piece of the puzzle… the coredns pod keep exiting with no obvious error

abundant-toothbrush-78602

11/26/2024, 1:20 AM

As far as I can tell the issue is DNS resolution within the fleet-agent pod so the agent cannot register back with rancher. The coredns containers keep exiting (they complain about the healthcheck taking >1s but I’m not sure if this is the cause). The fleet-agent cannot connect via UDP to 10.43.0.10 which is (I believe) the main DNS service of the cluster (and I guess served from the control plane i.e. from the very coredns services that are restarting?). I tried to add the IP manually to the hosts file within the fleet-agent BUT don’t have access as exec runs under a non-root user in the pod. The long delays in the coredns health shouldn’t be capacity linked - the controlplane has 32 vCPU and 32GB of RAM and is barely troubling either.

creamy-pencil-82913

11/26/2024, 1:40 AM

are all three servers running normally now? etcd and apiserver and kube-proxy are all healthy?

abundant-toothbrush-78602

11/26/2024, 1:41 AM

Yes as far as I can tell all those are fine, running happily

abundant-toothbrush-78602

11/26/2024, 1:42 AM

(as in the services are up, not exiting, not complaining and the probes all pass ok in the rancher interface)

abundant-toothbrush-78602

11/26/2024, 1:43 AM

I’ve just rebooted to do a clean startup and the only thing I can see in error is the calico-node container

abundant-toothbrush-78602

11/26/2024, 1:46 AM

Oh - nope that’s running ok now it seems

abundant-toothbrush-78602

11/26/2024, 1:46 AM

Logs look good including in /var/log/pods for calico-node, must have just been restarted during startup

abundant-toothbrush-78602

11/26/2024, 1:47 AM

Just that DNS error in fleet-agent 😞

abundant-toothbrush-78602

11/26/2024, 10:21 AM

So far then I have identified: (1) fleet-agent is failing to get a DNS resolution from 10.42.0.10 and (2) the coredns service on the control plane (same one that fleet-agent is also failing on) keeps exiting. I am unsure if this is a network issue (I’ve seen errors in coredns about failing to connect to our internal DNS servers) or just an issue with coredns (or a network issue from fleet-agent to coredns given the connection refused error though it’s UDP so I guess not much to actually refuse).

8 Views

Open in Slack

Previous Next