I feel like I m getting gaslighted but I don t really know w Rancher Users #elemental

I feel like I'm getting gaslighted, but I don't re...

bland-article-62755

10/24/2025, 8:10 PM

I feel like I'm getting gaslighted, but I don't really know where to turn to next. I have three installs of Rancher with Elemental installed. Our dev cluster, suddenly, stopped working while I was PXE booting a test box to it (It had beeen working fine all week). When I register nodes to it, on the final reboot the elemental-system-agent fails to start and throws this error message endlessly:

Copy code

Oct 24 20:07:10 .node systemd[1]: Started Elemental System Agent.
Oct 24 20:07:10 .node elemental-system-agent[26070]: time="2025-10-24T20:07:10Z" level=info msg="Rancher System Agent version dev (HEAD) is starting"
Oct 24 20:07:10 .node elemental-system-agent[26070]: time="2025-10-24T20:07:10Z" level=info msg="Using directory /var/lib/elemental/agent/work for work"
Oct 24 20:07:10 .node elemental-system-agent[26070]: time="2025-10-24T20:07:10Z" level=info msg="Starting remote watch of plans"
Oct 24 20:07:10 .node elemental-system-agent[26070]: time="2025-10-24T20:07:10Z" level=fatal msg="error while connecting to Kubernetes cluster: Get \"<https://dev-rancher.example.edu/k8s/clusters/local/version>\": tls: failed to verify certificate: x509: certificate signed by unknown authority"

The cert is signed by Let'sEncrypt and so are all the other instances.

curl

and

openssl

both work fine on the node and trust the cert that's presented, but for some reason this instance is just super messed up and I can't figure out why.

bland-article-62755

10/24/2025, 8:43 PM

There's only one pod in the

cattle-elemental-system

(the operator) and I restarted it, but I'm not super hopeful that things are going to change.

bland-article-62755

10/24/2025, 8:50 PM

yeah it didn't work.

bland-article-62755

10/24/2025, 8:55 PM

I guess uninstall and reinstall elemental?

bland-article-62755

10/25/2025, 12:06 AM

That didn't work either. I'm out of ideas.

witty-honey-18052

11/05/2025, 9:17 PM

I had been getting that as well. I don't think the error message is necessarily a direct reflection of the issue.

witty-honey-18052

11/05/2025, 9:18 PM

I'll ask one of my teammates to chime in

witty-honey-18052

11/05/2025, 9:18 PM

@swift-fireman-59958

witty-honey-18052

11/05/2025, 9:21 PM

I think it was also a similar problem we had with Harvester cluster registration

witty-honey-18052

11/05/2025, 9:21 PM

but it was always on the startup after the setup reboot

witty-honey-18052

11/05/2025, 9:24 PM

Is the agent-tls-mode on the Rancher management server set to strict or system-store?

swift-fireman-59958

11/05/2025, 9:31 PM

In the Rancher helm values, I believe

agent-tls-mode

should be set to

strict

if using rancher self signed certificates, or

system-store

if using a well known authority. Have you checked to see if those are set correctly for your environment? If you are using a custom certificate, then you'll need to add the certificate authority to the system store.

bland-article-62755

11/05/2025, 9:31 PM

It's system-store

bland-article-62755

11/05/2025, 9:32 PM

It's all signed by Let's Encrypt

swift-fireman-59958

11/05/2025, 9:32 PM

checks out. Have you rotated certificates for your rancher nodes? You can do that by rebooting each node.

bland-article-62755

11/05/2025, 9:33 PM

To be clear, it's multiple clusters, each signed by Let's Encrypt, and only one of them is throwing this error.

swift-fireman-59958

11/05/2025, 9:33 PM

I hear you, I would try either rotating certificates in the manage cluster view or rebooting the rancher nodes to rotate them that way and see if that yields any helpful results.

bland-article-62755

11/05/2025, 9:35 PM

The other thing I found was that curl (no

-k

) worked just fine from the console/ssh session.

witty-honey-18052

11/05/2025, 9:35 PM

yea I was about to ask about

-k

bland-article-62755

11/05/2025, 9:36 PM

Yeah so the system trusts the cert, but for whatever reason the agent doesn't.

witty-honey-18052

11/05/2025, 10:04 PM

what OS/version is it? maybe somehow it's an older OS that doesn't have the updated letsencrypt CAs? (which is certainly a stretch, but I'm grasping at straws if this is working on other clusters outside of this one node)

witty-honey-18052

11/05/2025, 10:06 PM

I imagine the k3s/rke2 cluster is up and running just fine on the node. Maybe a netshoot container will show different behavior if testing curl from there

bland-article-62755

11/05/2025, 10:24 PM

It's the latest versions. We're actually using the same ISO for all the clusters with just different config yaml.

bland-article-62755

11/05/2025, 10:24 PM

It's public and the certs are very valid everywhere else.

bland-article-62755

11/05/2025, 10:25 PM

Maybe there's a weird race condition with the agent starting and the system certs being read-able?

bland-article-62755

11/05/2025, 10:25 PM

Copy code

# cat /etc/os-release 
NAME="SL-Micro"
VERSION="6.1"
VERSION_ID="6.1"
PRETTY_NAME="SUSE Linux Micro 6.1"
ID="sl-micro"
ID_LIKE="suse sle-micro opensuse-microos microos"
ANSI_COLOR="0;32"
CPE_NAME="cpe:/o:suse:sl-micro:6.1"
HOME_URL="<https://www.suse.com/products/micro/>"
DOCUMENTATION_URL="<https://documentation.suse.com/sl-micro/6.1/>"
LOGO="distributor-logo"
IMAGE_REPO="<http://registry.suse.com/suse/sl-micro/6.1/baremetal-os-container|registry.suse.com/suse/sl-micro/6.1/baremetal-os-container>"
IMAGE_TAG="2.2.0-4.4"
IMAGE="<http://registry.suse.com/suse/sl-micro/6.1/baremetal-os-container:2.2.0-4.4|registry.suse.com/suse/sl-micro/6.1/baremetal-os-container:2.2.0-4.4>"
TIMESTAMP=20250211173134
GRUB_ENTRY_NAME="SUSE Linux Micro"

bland-article-62755

11/05/2025, 10:27 PM

It was working fine for a long time then suddenly broke for this one cluster over and over again.

swift-fireman-59958

11/05/2025, 10:28 PM

Curious, maybe you've stated this and I just missed it, have you tried with a freshly build seed image iso?

bland-article-62755

11/05/2025, 10:28 PM

I hadn't because we're doing PXE and it works with all the other registration endpoints.

swift-fireman-59958

11/05/2025, 10:29 PM

Gotcha, I only ask as a troubleshooting step

bland-article-62755

11/05/2025, 10:29 PM

fwiw I have call with Support tomorrow morning related to this: https://github.com/rancher/elemental/issues/1736

👍 1

swift-fireman-59958

11/05/2025, 10:29 PM

if you have the ability to try it and see what it yields.

bland-article-62755

11/05/2025, 10:29 PM

It just takes like an hour. 😅

swift-fireman-59958

11/05/2025, 10:30 PM

heh, I feel you. I've done my fair share of hours on hours debugging lately lol

bland-article-62755

11/05/2025, 10:30 PM

And it's not like it was never working... it was... then it stopped. And continues to work in other (almost identical) clusters

swift-fireman-59958

11/05/2025, 10:31 PM

right, that is frustrating for sure.

swift-fireman-59958

11/05/2025, 10:31 PM

I am interested, if the support call is successful, I'd like to hear what the solution is.

bland-article-62755

11/05/2025, 10:31 PM

For the install hooks?

swift-fireman-59958

11/05/2025, 10:32 PM

ah I mis-understood, I thought you had a support call regarding this as well

swift-fireman-59958

11/05/2025, 10:32 PM

that's my bad

bland-article-62755

11/05/2025, 10:33 PM

Ah, yeah no. That other bug affects prod and this just stopped in a temp dev environment.

swift-fireman-59958

11/05/2025, 10:34 PM

Just for sanity sake, are the ntp settings in sync between the two clusters?

swift-fireman-59958

11/05/2025, 10:34 PM

like do they share the same UTC time? I had a similar issue in an environment recently and that was the fix.

bland-article-62755

11/05/2025, 10:35 PM

yeah they're all in our data center and they should all be using the orgs ntp servers.

swift-fireman-59958

11/05/2025, 10:35 PM

gotcha. Might be worth verifying just as a sanity check.

bland-article-62755

11/05/2025, 10:36 PM

I'll likely wait until after the call tomorrow. It's very possible that I go back and try again and it'll all just be "working" ™️

swift-fireman-59958

11/05/2025, 10:36 PM

ha, that's always how it goes

bland-article-62755

11/05/2025, 10:37 PM

job security goes up, our sanity goes down.

😅 1

swift-fireman-59958

11/05/2025, 10:37 PM

And honestly, I do hope its that easy for you 😂

witty-honey-18052

11/05/2025, 10:37 PM

thing I hate about these damn SSL errors is that the logging sucks for them, and half the time it's not an issue with the cert/chain, and the failed cert just becomes a generic error message.

bland-article-62755

11/05/2025, 10:38 PM

The thing that threw me is that curl validated it on the node no problem.

✅ 1

swift-fireman-59958

11/05/2025, 10:38 PM

Yeah, I mean there are many factors that can cause it to fail, but yeah, better logging would be great.

4 Views

Open in Slack

Previous Next