I feel like I'm getting gaslighted, but I don't re...
# elemental
b
I feel like I'm getting gaslighted, but I don't really know where to turn to next. I have three installs of Rancher with Elemental installed. Our dev cluster, suddenly, stopped working while I was PXE booting a test box to it (It had beeen working fine all week). When I register nodes to it, on the final reboot the elemental-system-agent fails to start and throws this error message endlessly:
Copy code
Oct 24 20:07:10 .node systemd[1]: Started Elemental System Agent.
Oct 24 20:07:10 .node elemental-system-agent[26070]: time="2025-10-24T20:07:10Z" level=info msg="Rancher System Agent version dev (HEAD) is starting"
Oct 24 20:07:10 .node elemental-system-agent[26070]: time="2025-10-24T20:07:10Z" level=info msg="Using directory /var/lib/elemental/agent/work for work"
Oct 24 20:07:10 .node elemental-system-agent[26070]: time="2025-10-24T20:07:10Z" level=info msg="Starting remote watch of plans"
Oct 24 20:07:10 .node elemental-system-agent[26070]: time="2025-10-24T20:07:10Z" level=fatal msg="error while connecting to Kubernetes cluster: Get \"<https://dev-rancher.example.edu/k8s/clusters/local/version>\": tls: failed to verify certificate: x509: certificate signed by unknown authority"
The cert is signed by Let'sEncrypt and so are all the other instances.
curl
and
openssl
both work fine on the node and trust the cert that's presented, but for some reason this instance is just super messed up and I can't figure out why.
There's only one pod in the
cattle-elemental-system
(the operator) and I restarted it, but I'm not super hopeful that things are going to change.
yeah it didn't work.
I guess uninstall and reinstall elemental?
That didn't work either. I'm out of ideas.
w
I had been getting that as well. I don't think the error message is necessarily a direct reflection of the issue.
I'll ask one of my teammates to chime in
@swift-fireman-59958
I think it was also a similar problem we had with Harvester cluster registration
but it was always on the startup after the setup reboot
Is the agent-tls-mode on the Rancher management server set to strict or system-store?
s
In the Rancher helm values, I believe
agent-tls-mode
should be set to
strict
if using rancher self signed certificates, or
system-store
if using a well known authority. Have you checked to see if those are set correctly for your environment? If you are using a custom certificate, then you'll need to add the certificate authority to the system store.
b
It's system-store
It's all signed by Let's Encrypt
s
checks out. Have you rotated certificates for your rancher nodes? You can do that by rebooting each node.
b
To be clear, it's multiple clusters, each signed by Let's Encrypt, and only one of them is throwing this error.
s
I hear you, I would try either rotating certificates in the manage cluster view or rebooting the rancher nodes to rotate them that way and see if that yields any helpful results.
b
The other thing I found was that curl (no
-k
) worked just fine from the console/ssh session.
w
yea I was about to ask about
-k
b
Yeah so the system trusts the cert, but for whatever reason the agent doesn't.
w
what OS/version is it? maybe somehow it's an older OS that doesn't have the updated letsencrypt CAs? (which is certainly a stretch, but I'm grasping at straws if this is working on other clusters outside of this one node)
I imagine the k3s/rke2 cluster is up and running just fine on the node. Maybe a netshoot container will show different behavior if testing curl from there
b
It's the latest versions. We're actually using the same ISO for all the clusters with just different config yaml.
It's public and the certs are very valid everywhere else.
Maybe there's a weird race condition with the agent starting and the system certs being read-able?
Copy code
# cat /etc/os-release 
NAME="SL-Micro"
VERSION="6.1"
VERSION_ID="6.1"
PRETTY_NAME="SUSE Linux Micro 6.1"
ID="sl-micro"
ID_LIKE="suse sle-micro opensuse-microos microos"
ANSI_COLOR="0;32"
CPE_NAME="cpe:/o:suse:sl-micro:6.1"
HOME_URL="<https://www.suse.com/products/micro/>"
DOCUMENTATION_URL="<https://documentation.suse.com/sl-micro/6.1/>"
LOGO="distributor-logo"
IMAGE_REPO="<http://registry.suse.com/suse/sl-micro/6.1/baremetal-os-container|registry.suse.com/suse/sl-micro/6.1/baremetal-os-container>"
IMAGE_TAG="2.2.0-4.4"
IMAGE="<http://registry.suse.com/suse/sl-micro/6.1/baremetal-os-container:2.2.0-4.4|registry.suse.com/suse/sl-micro/6.1/baremetal-os-container:2.2.0-4.4>"
TIMESTAMP=20250211173134
GRUB_ENTRY_NAME="SUSE Linux Micro"
It was working fine for a long time then suddenly broke for this one cluster over and over again.
s
Curious, maybe you've stated this and I just missed it, have you tried with a freshly build seed image iso?
b
I hadn't because we're doing PXE and it works with all the other registration endpoints.
s
Gotcha, I only ask as a troubleshooting step
b
fwiw I have call with Support tomorrow morning related to this: https://github.com/rancher/elemental/issues/1736
๐Ÿ‘ 1
s
if you have the ability to try it and see what it yields.
b
It just takes like an hour. ๐Ÿ˜…
s
heh, I feel you. I've done my fair share of hours on hours debugging lately lol
b
And it's not like it was never working... it was... then it stopped. And continues to work in other (almost identical) clusters
s
right, that is frustrating for sure.
I am interested, if the support call is successful, I'd like to hear what the solution is.
b
For the install hooks?
s
ah I mis-understood, I thought you had a support call regarding this as well
that's my bad
b
Ah, yeah no. That other bug affects prod and this just stopped in a temp dev environment.
s
Just for sanity sake, are the ntp settings in sync between the two clusters?
like do they share the same UTC time? I had a similar issue in an environment recently and that was the fix.
b
yeah they're all in our data center and they should all be using the orgs ntp servers.
s
gotcha. Might be worth verifying just as a sanity check.
b
I'll likely wait until after the call tomorrow. It's very possible that I go back and try again and it'll all just be "working" โ„ข๏ธ
s
ha, that's always how it goes
b
job security goes up, our sanity goes down.
๐Ÿ˜… 1
s
And honestly, I do hope its that easy for you ๐Ÿ˜‚
w
thing I hate about these damn SSL errors is that the logging sucks for them, and half the time it's not an issue with the cert/chain, and the failed cert just becomes a generic error message.
b
The thing that threw me is that curl validated it on the node no problem.
โœ… 1
s
Yeah, I mean there are many factors that can cause it to fail, but yeah, better logging would be great.