This message was deleted.
# rke2
a
This message was deleted.
c
certs should be renewed on startup whenever they are within 90 days of expiration. manual rotation should only be necessary if you want to renew them even further out. I’m not aware of any issues with certificates for some components not being renewed, either automatically or when triggered manually. Can you provide more information on what specifically you’re seeing? What are the errors you’re getting? Do you know which specific files contain the certs not being renewed?
g
This is the impression I've been living in, they should renew them selves. The cluster is obviously unavailable through kubectl, Rancher UI shows that the cluster is updating and the nodes are reconciling with status waiting for probes: kube-controller-manager and kube-scheduler. Checking the certs locally on nodes shows that all other certs have been renewed except these two. Manually rotating goes through but won't renew.
c
Are there any errors locally, on the node itself? when you’re checking the certs locally, which files are you looking at?
g
No, only errors are the certs are expired or not valid. I'm checking the files from /var/lib/rancher/rke2/server/tls/
c
right but where are the errors. In the rancher UI only? or also in the logs? Can you paste in or screenshot the exact errors?
Also which specific files in that directory are you checking, and have found to be expired?
g
Copy code
ar  3 08:44:31 commodore3k3 rancher-system-agent[2146894]: time="2023-03-03T08:44:31Z" level=debug msg="[K8s] Enqueueing after 5.000000 seconds"
Mar  3 08:44:31 commodore3k3 rancher-system-agent[2146894]: time="2023-03-03T08:44:31Z" level=debug msg="[K8s] secret data/string-data did not change, not updating secret"
Mar  3 08:44:35 commodore3k3 rke2[343307]: time="2023-03-03T08:44:35Z" level=info msg="Connecting to proxy" url="<wss://10.100.0.111:9345/v1-rke2/connect>"
Mar  3 08:44:35 commodore3k3 rke2[343307]: time="2023-03-03T08:44:35Z" level=error msg="Failed to connect to proxy" error="dial tcp 10.100.0.111:9345: connect: connection refused"
Mar  3 08:44:35 commodore3k3 rke2[343307]: time="2023-03-03T08:44:35Z" level=error msg="Remotedialer proxy error" error="dial tcp 10.100.0.111:9345: connect: connection refused"
Mar  3 08:44:36 commodore3k3 rancher-system-agent[2146894]: time="2023-03-03T08:44:36Z" level=debug msg="[K8s] Processing secret custom-c1e7332ad5ff-machine-plan in namespace fleet-default at generation 0 with resource version 187308023"
Mar  3 08:44:36 commodore3k3 rancher-system-agent[2146894]: time="2023-03-03T08:44:36Z" level=debug msg="[K8s] Calculated checksum to be 021f9feebc8c223fc0d2045f328b30981147232748d5968cc26ffae7bb40fbd1"
Mar  3 08:44:36 commodore3k3 rancher-system-agent[2146894]: time="2023-03-03T08:44:36Z" level=debug msg="[K8s] Remote plan had an applied checksum value of 021f9feebc8c223fc0d2045f328b30981147232748d5968cc26ffae7bb40fbd1"
Mar  3 08:44:36 commodore3k3 rancher-system-agent[2146894]: time="2023-03-03T08:44:36Z" level=debug msg="[K8s] Applied checksum was the same as the plan from remote. Not applying."
Mar  3 08:44:36 commodore3k3 rancher-system-agent[2146894]: time="2023-03-03T08:44:36Z" level=debug msg="[K8s] needsApplied was false, not applying"
Mar  3 08:44:36 commodore3k3 rancher-system-agent[2146894]: time="2023-03-03T08:44:36Z" level=debug msg="[K8s] writing an applied checksum value of 021f9feebc8c223fc0d2045f328b30981147232748d5968cc26ffae7bb40fbd1 to the remote plan"
Mar  3 08:44:36 commodore3k3 rancher-system-agent[2146894]: time="2023-03-03T08:44:36Z" level=debug msg="[K8s] Enqueueing after 5.000000 seconds"
Mar  3 08:44:36 commodore3k3 rancher-system-agent[2146894]: time="2023-03-03T08:44:36Z" level=debug msg="[K8s] secret data/string-data did not change, not updating secret"
Mar  3 08:44:40 commodore3k3 rke2[343307]: time="2023-03-03T08:44:40Z" level=info msg="Connecting to proxy" url="<wss://10.100.0.111:9345/v1-rke2/connect>"
Mar  3 08:44:40 commodore3k3 rke2[343307]: time="2023-03-03T08:44:40Z" level=error msg="Failed to connect to proxy" error="dial tcp 10.100.0.111:9345: connect: connection refused"
Mar  3 08:44:40 commodore3k3 rke2[343307]: time="2023-03-03T08:44:40Z" level=error msg="Remotedialer proxy error" error="dial tcp 10.100.0.111:9345: connect: connection refused"
Mar  3 08:44:41 commodore3k3 rancher-system-agent[2146894]: time="2023-03-03T08:44:41Z" level=debug msg="[K8s] Processing secret custom-c1e7332ad5ff-machine-plan in namespace fleet-default at generation 0 with resource version 187308023"
and these are the expired ones: ../tls/kube-controller-manager/kube-controller-manager.crt ../tls/kube-scheduler/kube-scheduler.crt
c
I don’t see anything in that error log about expired certs. I see that RKE2 on this node is trying to connect to another server at
10.100.0.111
- what’s the deal with that? Is that node offline at the moment?
Those two certs are not managed by RKE2 itself, they are self-signed certs generated internally by the controller-manager and scheduler pods. I believe those Kubernetes components should regenerate them when the pods restart.
But then I’m not seeing the errors you mentioned, all I see is that there’s a server that can’t be connected to.
g
so there is 3 masters/etcd, 112 is in running state and has controller and scheduler certs expired, 111 is another master as well as 110.
The kube-controller and schedulers have been restarted but yet the certs are expired
c
Do you have any reason why that node wouldn’t be able to connect to
10.100.0.111:9345
? That’s the only real problem I see here. The other certs are literally only used for rancher-system-agent to perform health checks and do not matter at all for the functionality of the cluster.
What do the logs on the 111 node say?
g
Not a clue, 9345 is Kubernetes API, but I'm not sure which pod listens that. Or is it the kube-controller-manage?
c
no, that is not the Kubernetes api. it is the RKE2 supervisor. The kubernetes api is on 6443. Inability to connect to that suggests that either something is blocking the connection, or the rke2-server service is stopped/crashed
For the certs, you can try deleting the expired certs and their corresponding keys, then restarting the two pods. It will generate new ones on startup.
they’re just self-signed and should be managed by the respective Kubernetes components, I’m not sure why it would be generating them but not renewing them.
g
Ok I'll test
c
you might need to restart rancher-system-agent to get it to pick up the new certs after you do that
👍 1
g
I'll let you know
Done, did not create the certificates
Copy code
2023-03-03T08:45:01.87879211Z stderr F I0303 08:45:01.878607       1 tlsconfig.go:240] Starting DynamicServingCertificateController2023-03-03T08:45:01.979498892Z stderr F I0303 08:45:01.979265       1 leaderelection.go:243] attempting to acquire leader lease kube-system/kube-scheduler...2023-03-03T08:45:53.292640882Z stderr F I0303 08:45:53.292296       1 leaderelection.go:253] successfully acquired lease kube-system/kube-scheduler2023-03-03T10:00:12.395390815Z stderr F E0303 10:00:12.395078       1 leaderelection.go:325] error retrieving resource lock kube-system/kube-scheduler: Get "<https://127.0.0.1:6443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-scheduler?timeout=5s>": net/http: request canceled (Client.Timeout exceeded while awaiting headers)2023-03-03T10:18:01.879361898Z stderr F E0303 10:18:01.879041       1 dynamic_serving_content.go:165] key failed with : open /var/lib/rancher/rke2/server/tls/kube-scheduler/kube-scheduler.crt: no such file or directory2023-03-03T10:18:01.884926106Z stderr F E0303 10:18:01.884539       1 dynamic_serving_content.go:165] key failed with : open /var/lib/rancher/rke2/server/tls/kube-scheduler/kube-scheduler.crt: no such file or directory2023-03-03T10:18:01.8952527Z stderr F E0303 10:18:01.894999       1 dynamic_serving_content.go:165] key failed with : open /var/lib/rancher/rke2/server/tls/kube-scheduler/kube-scheduler.crt: no such file or directory2023-03-03T10:18:01.915620913Z stderr F E0303 10:18:01.915398       1 dynamic_serving_content.go:165] key failed with : open /var/lib/rancher/rke2/server/tls/kube-scheduler/kube-scheduler.crt: no such file or directory
that's the scheduler log now
c
The controller is complaining that it can’t talk to the local apiserver. Is the apiserver static pod running?
g
So the certificates had to be recreated manually:
Copy code
openssl genpkey -algorithm RSA -out kube-controller-manager.key 
openssl req -new -key kube-controller-manager.key -out kube-controller-manager-request.csr 
openssl x509 -req -in kube-controller-manager-request.csr -CA server-ca.crt -CAkey server-ca.key -CAcreateserial -out kube-controller-manager.crt -days 3652 -sha256 -extfile <(printf "subjectAltName=DNS:localhost,IP:127.0.0.1,IP:127.0.0.1")
That needs to be run on every master and rancher-server-agent needs to be restarted
h
@glamorous-lighter-5580 where did you find the information about that ? i think i have the same issue. with waiting for probes: kube-controller-manager and kube-scheduler; but the cluster seems to work normaly, otherways.
c
the certs are just self-signed, you don’t need to sign them with the cluster CA
g
For some reason the self-signed certs doesn't get re-created by rancher when rotating certs. It rotates everything else except kube-controller-manager and kube-scheduler. So basically only option is to generate new certs and replace them.
c
It's because they're not generated by rke2 or rancher, we just ask the controller-manager and scheduler to create them for itself and it does - but then apparently it has no logic to renew them when they expire.
h
thanks! did the same procedure for kube-controller-manager and kube-scheduler and it worked.
just fti I had a single cluster where the method did not work. but this method with crictl did work on that one. https://github.com/rancher/rancher/issues/41125#issuecomment-1506620040
do any of you know how the certificate rotation is supposed to work for certs that have noy yet expired? i have a cluster where all the certs have been rotated except the controller-manager or kube-scheduler. and all nodes have been rebooted, but they are still going out in a few days.
c
As discussed above, the two certs in question are not managed by either Rancher or rke2. They are generated by the core Kubernetes components themselves, but Kubernetes lacks logic to renew them.
In the future Rancher should handle this, since it is responsible for configuring the components to generate certs; it's not something that rke2 does by default.
108 Views