Hi everyone, we encountered a strange situation wi...
# rke2
a
Hi everyone, we encountered a strange situation with an RKE2 Rancher-provisioned cluster. We are running in a private datacenter, and the Rancher server was created with a 20-year self-signed certificate. After approximately two months, some VMs were rebooted for maintenance. After that, we found that the CA in the Rancher VM had been replaced by a 60-day certificate. On the other hand, in the master nodes, running the following command:
Copy code
/var/lib/rancher/rke2/bin/kubectl --kubeconfig /etc/rancher/rke2/rke2.yaml -n cattle-system logs cattle-cluster-agent-xxxxx-xxx
returned an error indicating that the certificate had expired (the same ca was found in the Rancher VM). The same error appeared in the Rancher agent logs when checking with:
Copy code
journalctl -xeu rancher-system-agent
My Solution: 1. Replaced the CA in Rancher (the certificate was fine, only the CA was incorrect). 2. This changed the checksum in the Rancher registration cluster, so I replaced it in each master node's
cattle-cluster-agent
deployment. 3. With that, the cluster was recovered, but the Rancher agent continued showing the error. 4. After searching through the code, I found the file:
/var/lib/rancher/agent/rancher2_connection_info.json
5. I updated
certificate-authority-data
with the base64-encoded correct CA. 6. After making this change on every node, the cluster in Rancher started running correctly, and all nodes appeared green. Some Questions: • Does Rancher internally work with 60-day certificates? • Does my solution introduce security concerns? • After updating a certificate, the guide Update Rancher Certificate mentions that it is necessary to update the cluster certificate, but it says nothing about the Rancher agent. Could this be a bug? (I don’t know if updating a self-signed certificate is a considered scenario.)
c
These look like Rancher questions, not rke2. As far as rke2 is concerned rancher is just another app deployed to the cluster. You might share a link to your comment in the #C3ASABBD1 channel. I am confused though, as it is really not clear what CAs you created and where and whether it was for rke2 or rancher.
a
Before creating the cluster mentioned above, i tested update the selfsigned rancher cert, rancher cert was successfully updated. After that, I updated CATTLE_CA_CHECKSUM in rke2 cluster, following rancher guides and not works, the cluster stucks in updating. today i think i can confirm is because rancher agent couldn't connect to rancher to finish the process (i have a pending ticket to test and confirm this) That's why I made the certificate valid for 20 years. Yesterday i found in the docker container of rancher (after 60 days of correct working), that the mounted CA was no the correct (i dont know why) december 27: • generated the 20y cert with openssl • created the rancher server with docker mounting generated cert, key and ca, (ca = cert) • created rke2 cluster february 25 • some alerts about not healty pods february 26 • found logs in cattle-cluster-agent pods and rancher-system-agent service about expired ca (exactly 60 days after creation) yesterday • replaced ca and restarted rancher server • updated ca_checksum in master nodes, this allow cluster access but nodes in "Cluster Management" stucks in updating(master) and one in error(worker) today • replaced
certificate-authority-data
in all nodes file
/var/lib/rancher/agent/rancher2_connection_info.json
, no more logs about • i dont know why ca.pem was different in rancher server (the exactly 60 days cert makes me think is a rancher behavior, but is speculation) • i need to test if updating ca in json file, allow to update self signed cert I hope this clarifies for you
c
We don’t technically support using the rancher/rancher docker container standalone. It is really just for very lightweight dev or proof-of-concept use. For any production env you must deploy Rancher to a Kubernetes cluster using the helm chart.
a
Yep, i know is not supported officially, but not my decision. I found the bug,
Copy code
docker run -v $PWD/cert:/certs \
  -e SSL_SUBJECT=rancherfqdn \
  -e SSL_DNS=rancherfqdn \
  -e SSL_IP=10.0.0.1 -e SSL_EXPIRE=3650 -e CA_SUBJECT=subject \
  superseb/omgwtfssl
the cert was generated with this command, the ca generated, not respect the SSL_EXPIRE variable, only the cert has the correct expire date
In the next days i will try updating the selfsigned cert adding the steps mentioned above, if works i will report 👍
c
Sebastiaan is also no longer with SUSE so that ssl helper image also isn’t really safe to rely on. I don’t think he’d go and delete it but it’s definitely not maintained. As a matter of fact I think that image is just a 7 year old one-time build of a fork of another project.
you’ve got a lot of stuff going on here that is setting you up for a very difficult to support environment.
a
good to know about seb, thanks for the help @creamy-pencil-82913 👍
About the reported situation, i can close it, now i understand what happened
@creamy-pencil-82913 Hi, I can confirm that when you update a self-signed certificate, you also need to update the CA in
/var/lib/rancher/agent/rancher2_connection_info.json
(because the certificate itself acts as the CA). For private certificates, the CA typically expires in 10-20 years, which might become a problem in the future. Does this warrant opening a ticket?