This message was deleted Rancher Users #kubernetes

Join Slack

This message was deleted.

# kubernetes

adamant-kite-43734

08/24/2023, 7:59 PM

This message was deleted.

creamy-pencil-82913

08/24/2023, 8:17 PM

The rancher-system-agent logs are very noisy. You will see errors about missing certs until rke2/k3s starts up and creates them. look for other errors, and also check the rke2 logs for errors (assuming its getting as far as trying to start it).

full-train-34126

08/24/2023, 8:21 PM

Ok, noted. It's not getting as far as rke2, in Rancher it's stuck saying "error applying plan..." A few second into the node registering. I'll keep digging and get more into the logs. Was hoping it was an easy fix...

creamy-pencil-82913

08/24/2023, 8:31 PM

if you want to grab the full logs from rancher-system-agent journald log and dump em somewhere I can take a look when I have a moment

👍 1

full-train-34126

08/25/2023, 6:31 PM

Couldn't get security to sign off on exporting my logs unfortunately, but think I'm getting somewhere. I think I'm ballsing this up conceptually... I've got a management cluster with Rancher, managing a number of downstream clusters. Deploying everything downstream with terraform and the rancher provider. Conceptually, we do the whole cattle not pets thing, and when a cluster goes funky, we just destroy and rebuild either a node or sometimes the whole cluster from code exactly identical. The issue we are having is rancher-system-agent isn't pulling certs right or something. The TLS directory is non existent, the machine plan for rebuilding the same cluster doesn't have any of the tls folders, rke2 binaries etc. However, I create a brand new never before seen cluster, and it works fine. So, I get no one can give me specific advice without logs. But conceptually, am I shooting myself in the foot by destroying and recreating clusters with the same name/details/config?? Should we be protecting the cluster resource once created?

creamy-pencil-82913

08/25/2023, 6:55 PM

RKE2 itself doesn’t care about the rancher-side cluster. The TLS directories, binaries, etc all come from the RKE2 installer, not from the rancher-system-agent. All the rancher-system-agent does is pull the RKE2 install script out of another image (system-agent-installer-rke2), dump it to disk, and run it.

creamy-pencil-82913

08/25/2023, 6:55 PM

rancher-system-agent also manages the rke2 config of course, but that happens separate from installing RKE2 itself.

creamy-pencil-82913

08/25/2023, 6:56 PM

It sounds to me like RKE2 is failing to start for some reason, I would probably look at why. Maybe Rancher is trying to configure it to join an existing cluster for some reason?

full-train-34126

08/25/2023, 7:02 PM

Hmm... interesting 🤔. Not sure then. The rancher-system-agent logs were full of errors for DoProbe, probe etc. all complaining that the certificates for all the kubernetes components weren't there in /var/lib/rancher/rke2/server/tls didn't exist. That's as far as it got, didn't have the /var/lib/rancher/rke2/bin folder or rke2 or anything yet, and just hung there cycling through the probe errors. I need to understand the provisioning flow better I think. Out of office for a week, but I appreciate your help, I'll do some more digging and come back if I find anything. Might try and reproduce in my lab so I can share logs too.

creamy-pencil-82913

08/25/2023, 7:08 PM

if you look through the logs you should see it extracting out the rke2 install script, running it, and then starting the rke2 service.

creamy-pencil-82913

08/25/2023, 7:09 PM

The health checks run in parallel to that, and will be quite noisy until rke2 starts up and creates all the certs it wants to use to perform the health checks.

full-train-34126

08/25/2023, 7:29 PM

Ah right, that makes sense. So there's something not happening with that rke2 script then. Gotcha. Really appreciate the input.

Open in Slack

Previous Next