This message was deleted.
# general
a
This message was deleted.
c
what kubernetes distro? what is the specific error you’re getting?
h
v1.26.7+rke2r1
Error from agent nodes: Oct 04 134940 [hostnameremoved] rke2[2949459]: time="2024-10-04T134940-07:00" level=error msg="CA cert validation failed: Get \"https://127.0.0.1:6444/cacerts\": tls: failed to verify certificate: x509: certificate signed by unknown authority"
At periods of attempting to solve this, when using this script as refenced in some of the help docs. As it gets toward the end and does the http request it get stuck with a 401 Unauthorized
c
did you update the token in the node config?
h
No
c
https://docs.rke2.io/security/certificates
If a new root CA is required, the rotation will be disruptive. The
rke2 certificate rotate-ca --force
option must be used, all nodes (servers and agents) will need to be reconfigured to use the new token value, and pods will need to be restarted to trust the new root CA.
Since you’re getting an error about an untrusted CA, I’m assuming you broke the root of trust and the CA hash has changed.
l
Yes, it appears that way
c
that is also called out later on that page
If you used the
--force
option or changed the root CA, ensure that any nodes that were joined with a secure token are reconfigured to use the new token value, prior to being restarted. The token may be stored in a
.env
file, systemd unit, or config.yaml, depending on how the node was configured during initial installation.
l
I really appreciate. Just to clarify, we need to run that on the agents as well?
c
any nodes
l
This is valuable information!
c
yes thats why its in the docs lol
h
@creamy-pencil-82913 Boris and Aaron are my colleagues thank you for engaging with us on this.
c
everything should go a lot more smoothly if you’re using the same root CA. If you’re changing the root CA then things are much more complicated.
w
When running that on the agents we see this
Copy code
# rke2 certificate rotate-ca --path /var/lib/rancher/rke2
FATA[0000] open /var/lib/rancher/rke2/server/token: no such file or directory
do we need to copy that from the main server?
c
what?
Why are you rotating again?
You need to reconfigure them to update the token. Not rotate again.
Please go sit down and re-read that page
1. Pick a server node to rotate on 2. Run the script to generate new certs 3. Run the rotate-ca command to load the new certs into the datastore 4. Update the token on ALL the nodes to include the new token value that the script printed 5. Restart the service on ALL the nodes, servers first, then agents
Also, that page specifically tells to use a temp dir to hold the new certs and not overwrite the stuff in /var/lib/rancher/rke2, you should never find yourself runing the rotate-ca command against the current data dir
👍 1
w
Sorry, i've read the page several times. We did verify our token in the agents
/etc/rancher/rke2/config.yaml
matches the end portion of the value on the server
/var/lib/rancher/rke2/server/token
when we try to run ANY
rke2 certificate rotate
command on an agent it give this error
Copy code
# rke2 certificate rotate
FATA[0000] open /var/lib/rancher/rke2/server/token: no such file or directory
That's why we thought it was only run on the server and not the agents
our server starts, our agents wont
we rotated because our certs where expired.
c
v1.26 is pretty old
w
I was wondering about updating...
l
hahaha
it is pretty old
c
you don’t need to do a rotate. You just need to update the token and then restart.
There was a bug where
certificate rotate
would fail on agents because it was trying to rotate files that only exist on the server but that is long fixed
l
I updated the token on config.yaml of the agent node which leads me to a question. Do I use the node-toke in the rke2-server
c
You’re not even on the last patch for 1.26.x, you might try at least getting on the latest patch release for that minor.
❤️ 1
l
sorry it has been a long day
c
If you didn’t start the server with a --agent-token value then there is not a separate agent-only token, you’d just want to use the server token for joining both servers and agents.
they’re all the same thing if you didn’t set up a separate agent token value
Copy code
root@rke2-server-1:/# ls -la /var/lib/rancher/rke2/server/*token 
lrwxrwxrwx 1 root root  34 Oct  4 20:23 /var/lib/rancher/rke2/server/agent-token -> /var/lib/rancher/rke2/server/token
lrwxrwxrwx 1 root root  34 Oct  4 20:23 /var/lib/rancher/rke2/server/node-token -> /var/lib/rancher/rke2/server/token
-rw------- 1 root root 109 Oct  4 20:23 /var/lib/rancher/rke2/server/token
l
I saw that 🙂. I see some people use the last part of the token firstmiddlelast
is that correct or should I use the entire length of the toke?
c
We haven’t migrated this content over to the rke2 docs yet, but you should read https://docs.k3s.io/cli/token#token-format
since it’s the secure token TLS bootstrapping process that you’re running into problems with
w
ya, after validating we're still getting the same error
Copy code
Oct 04 14:52:03  rke2[3223401]: time="2024-10-04T14:52:03-07:00" level=error msg="CA cert validation failed: Get \"<https://127.0.0.1:6444/cacerts>\": tls: failed to verify certificate: x509: certificate signed by unknown authority"
we updated the agent nodes to use the entire token and restarted it, but it gives the same error still
We think it may actually have something to do with the rke2-ingress-nginx-controller terminating SSL with a different cert
c
it shouldn’t, no. ingress has nothing to do with the connection to the supervisor or apiserver. It is not in that path in any way.
👍 1
Have you restarted all the servers already?
On the agent, what do you get from:
Copy code
curl -ks <https://SERVER:9345/cacerts> | openssl x509 -noout -text
echo QUIT | openssl s_client -connect SERVER:9345 | openssl x509 -noout -text
where SERVER is the host you’re using as the
server:
address in the agent config
l
you mean the actual servers correct?
not the just the agent service?
for restart?
c
by restart the servers, I mean restart the rke2-server service on the the server nodes
you know which nodes are servers and which are agents right?
h
you know which nodes are servers and which are agents right?
Yes, we are good with this one...
w
Copy code
$ echo QUIT | openssl s_client -connect `hostname`:9345 | openssl x509 -noout -text
depth=1 CN = rke2-server-ca@1727823995
Copy code
$ curl -ks https://`hostname`:9345/cacerts | openssl x509 -noout -text
        Subject: CN = rke2-server-ca@1728070064
l
we only have have one server node and the rest are agent nodes
c
Copy code
Issuer: CN = rke2-server-ca@1727823995
        Validity
            Not Before: Oct  1 23:06:35 2024 GMT
            Not After : Oct  2 00:12:11 2025 GMT
        Subject: O = rke2, CN = rke2


        Issuer: CN = rke2-server-ca@1728070064
        Validity
            Not Before: Oct  4 19:27:44 2024 GMT
            Not After : Oct  2 19:27:44 2034 GMT
        Subject: CN = rke2-server-ca@1728070064
The server cert isn’t signed by the new cluster CA, it’s still signed by the old one.
w
that's why I thought it was the loadbalancer
i.e. rke2-ingress-controller
l
earlier when I ran the script to rotate, it ran perfectly. Except when I ran rke2 certificate rotate-ca --path=/home/mydir/oct4certs/rotate-ca --force
c
the ingress runs on ports 80 and 443. the supervisor and apiserver are on 9345 and 6443. They are completely unrelated.
🙌 1
👍 1
Did you put an external load-balancer in front of port 9345?
l
it gave me a 401 error for the IP:port/cacerts address
w
i dont think we placed anything infront
c
So you did the rotate-ca, did you restart the rke2-server service after that?
w
yes
several times
h
We will restart it again as we follow this path explicitly. Just to be on the safe side.
c
ok. This may be something that we fixed later on, the version you’re on is pretty old. But try doing this on the server:
Copy code
rm /var/lib/rancher/rke2/server/tls/dynamic-cert.json; kubectl delete secret -n kube-system rke2-serving
then restart the rke2-server service
then run those two openssl commands again and see if the CAs match
l
thank you for that, the initial command worked like a charm, when I tried restarting the server service it threw the following:
level=fatal msg="/var/lib/rancher/rke2/server/tls/etcd/server-ca.crt, /var/lib/rancher/rke2/server/tls/service.key, /var/lib/rancher/rke2/server/tls/etcd/peer-ca.key, /var/lib/rancher/rke2/server/tls/server-ca.crt, /var/lib/rancher/rke2/server/tls/server-ca.key, /var/lib/rancher/rke2/server/tls/client-ca.crt, /var/lib/rancher/rke2/server/tls/client-ca.key, /var/lib/rancher/rke2/server/tls/etcd/server-ca.key, /var/lib/rancher/rke2/server/tls/request-header-ca.crt, /var/lib/rancher/rke2/server/tls/request-header-ca.key, /var/lib/rancher/rke2/server/tls/etcd/peer-ca.crt newer than datastore and could cause a cluster outage. Remove the file(s) from disk and restart to be recreated from datastore."
it failed the restart and I found that error on var/log
c
yeah thats because you ran the script pointed at the existing data dir and overwrote the files
l
ohh
c
you’re supposed to generate them in a temp dir and then run the rotate-ca command to load them in
so now you gotta go clean those files up and let it extract them from the datastore again
l
ok, could I solve that with removing the tls dir?
c
just delete them and restart it, it should be ok
l
ok, that's what I thought
I appreciate that!
c
maybe just rename the tls dir instead of deleting it
w
Per https://docs.rke2.io/security/certificates should we try this?
Copy code
# Create updated CA certs and keys, cross-signed by the current CAs.
# This script will create a new temporary directory containing the updated certs, and output the new token values.
curl -sL <https://github.com/k3s-io/k3s/raw/master/contrib/util/rotate-default-ca-certs.sh> | PRODUCT=rke2 bash -

# Load the updated certs into the datastore; see the script output for the updated token values.
rke2 certificate rotate-ca --path=/var/lib/rancher/rke2/server/rotate-ca
c
… isnt that what you already did?
thats how you got here in the first place, right? https://rancher-users.slack.com/archives/C3ASABBD1/p1728075981005419?thread_ts=1728073415.718259&cid=C3ASABBD1 That’s the same script
Just move that dir out of the way so it can re-extract the certs that you already updated from the datastore out to disk again.