This message was deleted.
# kubernetes
a
This message was deleted.
c
The rancher-system-agent logs are very noisy. You will see errors about missing certs until rke2/k3s starts up and creates them. look for other errors, and also check the rke2 logs for errors (assuming its getting as far as trying to start it).
f
Ok, noted. It's not getting as far as rke2, in Rancher it's stuck saying "error applying plan..." A few second into the node registering. I'll keep digging and get more into the logs. Was hoping it was an easy fix...
c
if you want to grab the full logs from rancher-system-agent journald log and dump em somewhere I can take a look when I have a moment
👍 1
f
Couldn't get security to sign off on exporting my logs unfortunately, but think I'm getting somewhere. I think I'm ballsing this up conceptually... I've got a management cluster with Rancher, managing a number of downstream clusters. Deploying everything downstream with terraform and the rancher provider. Conceptually, we do the whole cattle not pets thing, and when a cluster goes funky, we just destroy and rebuild either a node or sometimes the whole cluster from code exactly identical. The issue we are having is rancher-system-agent isn't pulling certs right or something. The TLS directory is non existent, the machine plan for rebuilding the same cluster doesn't have any of the tls folders, rke2 binaries etc. However, I create a brand new never before seen cluster, and it works fine. So, I get no one can give me specific advice without logs. But conceptually, am I shooting myself in the foot by destroying and recreating clusters with the same name/details/config?? Should we be protecting the cluster resource once created?
c
RKE2 itself doesn’t care about the rancher-side cluster. The TLS directories, binaries, etc all come from the RKE2 installer, not from the rancher-system-agent. All the rancher-system-agent does is pull the RKE2 install script out of another image (system-agent-installer-rke2), dump it to disk, and run it.
rancher-system-agent also manages the rke2 config of course, but that happens separate from installing RKE2 itself.
It sounds to me like RKE2 is failing to start for some reason, I would probably look at why. Maybe Rancher is trying to configure it to join an existing cluster for some reason?
f
Hmm... interesting 🤔. Not sure then. The rancher-system-agent logs were full of errors for DoProbe, probe etc. all complaining that the certificates for all the kubernetes components weren't there in /var/lib/rancher/rke2/server/tls didn't exist. That's as far as it got, didn't have the /var/lib/rancher/rke2/bin folder or rke2 or anything yet, and just hung there cycling through the probe errors. I need to understand the provisioning flow better I think. Out of office for a week, but I appreciate your help, I'll do some more digging and come back if I find anything. Might try and reproduce in my lab so I can share logs too.
c
if you look through the logs you should see it extracting out the rke2 install script, running it, and then starting the rke2 service.
The health checks run in parallel to that, and will be quite noisy until rke2 starts up and creates all the certs it wants to use to perform the health checks.
f
Ah right, that makes sense. So there's something not happening with that rke2 script then. Gotcha. Really appreciate the input.