This message was deleted Rancher Users #rke2

Join Slack

This message was deleted.

# rke2

adamant-kite-43734

03/02/2023, 7:17 AM

This message was deleted.

creamy-pencil-82913

03/02/2023, 10:13 AM

certs should be renewed on startup whenever they are within 90 days of expiration. manual rotation should only be necessary if you want to renew them even further out. I’m not aware of any issues with certificates for some components not being renewed, either automatically or when triggered manually. Can you provide more information on what specifically you’re seeing? What are the errors you’re getting? Do you know which specific files contain the certs not being renewed?

glamorous-lighter-5580

03/02/2023, 11:09 AM

This is the impression I've been living in, they should renew them selves. The cluster is obviously unavailable through kubectl, Rancher UI shows that the cluster is updating and the nodes are reconciling with status waiting for probes: kube-controller-manager and kube-scheduler. Checking the certs locally on nodes shows that all other certs have been renewed except these two. Manually rotating goes through but won't renew.

creamy-pencil-82913

03/02/2023, 5:50 PM

Are there any errors locally, on the node itself? when you’re checking the certs locally, which files are you looking at?

glamorous-lighter-5580

03/03/2023, 8:29 AM

No, only errors are the certs are expired or not valid. I'm checking the files from /var/lib/rancher/rke2/server/tls/

creamy-pencil-82913

03/03/2023, 8:53 AM

right but where are the errors. In the rancher UI only? or also in the logs? Can you paste in or screenshot the exact errors?

creamy-pencil-82913

03/03/2023, 8:54 AM

Also which specific files in that directory are you checking, and have found to be expired?

glamorous-lighter-5580

03/03/2023, 8:54 AM

Copy code

ar  3 08:44:31 commodore3k3 rancher-system-agent[2146894]: time="2023-03-03T08:44:31Z" level=debug msg="[K8s] Enqueueing after 5.000000 seconds"
Mar  3 08:44:31 commodore3k3 rancher-system-agent[2146894]: time="2023-03-03T08:44:31Z" level=debug msg="[K8s] secret data/string-data did not change, not updating secret"
Mar  3 08:44:35 commodore3k3 rke2[343307]: time="2023-03-03T08:44:35Z" level=info msg="Connecting to proxy" url="<wss://10.100.0.111:9345/v1-rke2/connect>"
Mar  3 08:44:35 commodore3k3 rke2[343307]: time="2023-03-03T08:44:35Z" level=error msg="Failed to connect to proxy" error="dial tcp 10.100.0.111:9345: connect: connection refused"
Mar  3 08:44:35 commodore3k3 rke2[343307]: time="2023-03-03T08:44:35Z" level=error msg="Remotedialer proxy error" error="dial tcp 10.100.0.111:9345: connect: connection refused"
Mar  3 08:44:36 commodore3k3 rancher-system-agent[2146894]: time="2023-03-03T08:44:36Z" level=debug msg="[K8s] Processing secret custom-c1e7332ad5ff-machine-plan in namespace fleet-default at generation 0 with resource version 187308023"
Mar  3 08:44:36 commodore3k3 rancher-system-agent[2146894]: time="2023-03-03T08:44:36Z" level=debug msg="[K8s] Calculated checksum to be 021f9feebc8c223fc0d2045f328b30981147232748d5968cc26ffae7bb40fbd1"
Mar  3 08:44:36 commodore3k3 rancher-system-agent[2146894]: time="2023-03-03T08:44:36Z" level=debug msg="[K8s] Remote plan had an applied checksum value of 021f9feebc8c223fc0d2045f328b30981147232748d5968cc26ffae7bb40fbd1"
Mar  3 08:44:36 commodore3k3 rancher-system-agent[2146894]: time="2023-03-03T08:44:36Z" level=debug msg="[K8s] Applied checksum was the same as the plan from remote. Not applying."
Mar  3 08:44:36 commodore3k3 rancher-system-agent[2146894]: time="2023-03-03T08:44:36Z" level=debug msg="[K8s] needsApplied was false, not applying"
Mar  3 08:44:36 commodore3k3 rancher-system-agent[2146894]: time="2023-03-03T08:44:36Z" level=debug msg="[K8s] writing an applied checksum value of 021f9feebc8c223fc0d2045f328b30981147232748d5968cc26ffae7bb40fbd1 to the remote plan"
Mar  3 08:44:36 commodore3k3 rancher-system-agent[2146894]: time="2023-03-03T08:44:36Z" level=debug msg="[K8s] Enqueueing after 5.000000 seconds"
Mar  3 08:44:36 commodore3k3 rancher-system-agent[2146894]: time="2023-03-03T08:44:36Z" level=debug msg="[K8s] secret data/string-data did not change, not updating secret"
Mar  3 08:44:40 commodore3k3 rke2[343307]: time="2023-03-03T08:44:40Z" level=info msg="Connecting to proxy" url="<wss://10.100.0.111:9345/v1-rke2/connect>"
Mar  3 08:44:40 commodore3k3 rke2[343307]: time="2023-03-03T08:44:40Z" level=error msg="Failed to connect to proxy" error="dial tcp 10.100.0.111:9345: connect: connection refused"
Mar  3 08:44:40 commodore3k3 rke2[343307]: time="2023-03-03T08:44:40Z" level=error msg="Remotedialer proxy error" error="dial tcp 10.100.0.111:9345: connect: connection refused"
Mar  3 08:44:41 commodore3k3 rancher-system-agent[2146894]: time="2023-03-03T08:44:41Z" level=debug msg="[K8s] Processing secret custom-c1e7332ad5ff-machine-plan in namespace fleet-default at generation 0 with resource version 187308023"

glamorous-lighter-5580

03/03/2023, 8:55 AM

and these are the expired ones: ../tls/kube-controller-manager/kube-controller-manager.crt ../tls/kube-scheduler/kube-scheduler.crt

creamy-pencil-82913

03/03/2023, 9:18 AM

I don’t see anything in that error log about expired certs. I see that RKE2 on this node is trying to connect to another server at

10.100.0.111

- what’s the deal with that? Is that node offline at the moment?

creamy-pencil-82913

03/03/2023, 9:21 AM

Those two certs are not managed by RKE2 itself, they are self-signed certs generated internally by the controller-manager and scheduler pods. I believe those Kubernetes components should regenerate them when the pods restart.

creamy-pencil-82913

03/03/2023, 9:22 AM

But then I’m not seeing the errors you mentioned, all I see is that there’s a server that can’t be connected to.

glamorous-lighter-5580

03/03/2023, 10:01 AM

so there is 3 masters/etcd, 112 is in running state and has controller and scheduler certs expired, 111 is another master as well as 110.

glamorous-lighter-5580

03/03/2023, 10:02 AM

The kube-controller and schedulers have been restarted but yet the certs are expired

creamy-pencil-82913

03/03/2023, 10:08 AM

Do you have any reason why that node wouldn’t be able to connect to

10.100.0.111:9345

? That’s the only real problem I see here. The other certs are literally only used for rancher-system-agent to perform health checks and do not matter at all for the functionality of the cluster.

creamy-pencil-82913

03/03/2023, 10:09 AM

What do the logs on the 111 node say?

glamorous-lighter-5580

03/03/2023, 10:10 AM

Not a clue, 9345 is Kubernetes API, but I'm not sure which pod listens that. Or is it the kube-controller-manage?

creamy-pencil-82913

03/03/2023, 10:11 AM

no, that is not the Kubernetes api. it is the RKE2 supervisor. The kubernetes api is on 6443. Inability to connect to that suggests that either something is blocking the connection, or the rke2-server service is stopped/crashed

creamy-pencil-82913

03/03/2023, 10:12 AM

For the certs, you can try deleting the expired certs and their corresponding keys, then restarting the two pods. It will generate new ones on startup.

creamy-pencil-82913

03/03/2023, 10:13 AM

they’re just self-signed and should be managed by the respective Kubernetes components, I’m not sure why it would be generating them but not renewing them.

glamorous-lighter-5580

03/03/2023, 10:13 AM

Ok I'll test

creamy-pencil-82913

03/03/2023, 10:13 AM

you might need to restart rancher-system-agent to get it to pick up the new certs after you do that

👍 1

glamorous-lighter-5580

03/03/2023, 10:18 AM

I'll let you know

glamorous-lighter-5580

03/03/2023, 10:20 AM

Done, did not create the certificates

glamorous-lighter-5580

03/03/2023, 10:22 AM

Copy code

2023-03-03T08:45:01.87879211Z stderr F I0303 08:45:01.878607       1 tlsconfig.go:240] Starting DynamicServingCertificateController2023-03-03T08:45:01.979498892Z stderr F I0303 08:45:01.979265       1 leaderelection.go:243] attempting to acquire leader lease kube-system/kube-scheduler...2023-03-03T08:45:53.292640882Z stderr F I0303 08:45:53.292296       1 leaderelection.go:253] successfully acquired lease kube-system/kube-scheduler2023-03-03T10:00:12.395390815Z stderr F E0303 10:00:12.395078       1 leaderelection.go:325] error retrieving resource lock kube-system/kube-scheduler: Get "<https://127.0.0.1:6443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-scheduler?timeout=5s>": net/http: request canceled (Client.Timeout exceeded while awaiting headers)2023-03-03T10:18:01.879361898Z stderr F E0303 10:18:01.879041       1 dynamic_serving_content.go:165] key failed with : open /var/lib/rancher/rke2/server/tls/kube-scheduler/kube-scheduler.crt: no such file or directory2023-03-03T10:18:01.884926106Z stderr F E0303 10:18:01.884539       1 dynamic_serving_content.go:165] key failed with : open /var/lib/rancher/rke2/server/tls/kube-scheduler/kube-scheduler.crt: no such file or directory2023-03-03T10:18:01.8952527Z stderr F E0303 10:18:01.894999       1 dynamic_serving_content.go:165] key failed with : open /var/lib/rancher/rke2/server/tls/kube-scheduler/kube-scheduler.crt: no such file or directory2023-03-03T10:18:01.915620913Z stderr F E0303 10:18:01.915398       1 dynamic_serving_content.go:165] key failed with : open /var/lib/rancher/rke2/server/tls/kube-scheduler/kube-scheduler.crt: no such file or directory

glamorous-lighter-5580

03/03/2023, 10:22 AM

that's the scheduler log now

creamy-pencil-82913

03/03/2023, 6:41 PM

The controller is complaining that it can’t talk to the local apiserver. Is the apiserver static pod running?

glamorous-lighter-5580

03/06/2023, 10:32 AM

So the certificates had to be recreated manually:

Copy code

openssl genpkey -algorithm RSA -out kube-controller-manager.key 
openssl req -new -key kube-controller-manager.key -out kube-controller-manager-request.csr 
openssl x509 -req -in kube-controller-manager-request.csr -CA server-ca.crt -CAkey server-ca.key -CAcreateserial -out kube-controller-manager.crt -days 3652 -sha256 -extfile <(printf "subjectAltName=DNS:localhost,IP:127.0.0.1,IP:127.0.0.1")

glamorous-lighter-5580

03/06/2023, 10:32 AM

That needs to be run on every master and rancher-server-agent needs to be restarted

hallowed-window-565

04/12/2023, 6:13 PM

@glamorous-lighter-5580 where did you find the information about that ? i think i have the same issue. with waiting for probes: kube-controller-manager and kube-scheduler; but the cluster seems to work normaly, otherways.

creamy-pencil-82913

04/12/2023, 7:14 PM

the certs are just self-signed, you don’t need to sign them with the cluster CA

glamorous-lighter-5580

04/13/2023, 6:36 AM

For some reason the self-signed certs doesn't get re-created by rancher when rotating certs. It rotates everything else except kube-controller-manager and kube-scheduler. So basically only option is to generate new certs and replace them.

creamy-pencil-82913

04/13/2023, 4:58 PM

It's because they're not generated by rke2 or rancher, we just ask the controller-manager and scheduler to create them for itself and it does - but then apparently it has no logic to renew them when they expire.

hallowed-window-565

04/18/2023, 12:26 PM

thanks! did the same procedure for kube-controller-manager and kube-scheduler and it worked.

hallowed-window-565

04/19/2023, 10:27 AM

just fti I had a single cluster where the method did not work. but this method with crictl did work on that one. https://github.com/rancher/rancher/issues/41125#issuecomment-1506620040

hallowed-window-565

04/21/2023, 12:12 PM

do any of you know how the certificate rotation is supposed to work for certs that have noy yet expired? i have a cluster where all the certs have been rotated except the controller-manager or kube-scheduler. and all nodes have been rebooted, but they are still going out in a few days.

creamy-pencil-82913

04/21/2023, 3:12 PM

As discussed above, the two certs in question are not managed by either Rancher or rke2. They are generated by the core Kubernetes components themselves, but Kubernetes lacks logic to renew them.

creamy-pencil-82913

04/21/2023, 3:13 PM

In the future Rancher should handle this, since it is responsible for configuring the components to generate certs; it's not something that rke2 does by default.

158 Views

Open in Slack

Previous Next