This message was deleted Rancher Users #general

Join Slack

This message was deleted.

# general

adamant-kite-43734

01/16/2025, 8:46 PM

This message was deleted.

bitter-energy-61444

01/16/2025, 8:47 PM

This all looks like it went back to an expired set of certificates on the servers. That has been reconciled with the procedure mentioned here https://github.com/rancher/rancher/issues/41125

bitter-energy-61444

01/16/2025, 8:48 PM

so my question is generally, how do i fix the certificate rotation on a worker node and how do I resolve the stuck probes?

bitter-energy-61444

01/16/2025, 8:50 PM

workers running v1.28.15+rke2r1, rancher 2.10.1

bitter-energy-61444

01/16/2025, 9:01 PM

i have a sneaking suspicion it's stuck in an upgrade but i'm not sure how to check/fix that

mysterious-animal-29850

01/16/2025, 11:26 PM

Check the journal logs for rke2-server on the host.

bitter-energy-61444

01/16/2025, 11:38 PM

that's the thing, nothing terribly enlightening in that or the rke2-agent logs

mysterious-animal-29850

01/17/2025, 7:08 PM

On worker nodes you normally don't need to do anything, only on control-plane/etcd nodes. If you are seeing connectivity issues on worker nodes, I'd try restarting the rke2-agent systemd unit on each worker node. For extra safety, drain/cordon each node, then do the agent restart, then uncordon the node.

bitter-energy-61444

01/17/2025, 7:19 PM

the updating certificate issue would seem to be the least of my issues immediately (workloads run fine). the nagging issue is all my control plane nodes in a reconciling state for one of "waiting for kubelet to update" "waiting for probes"

mysterious-animal-29850

01/17/2025, 7:58 PM

and just to confirm, you did these steps?

Copy code

rm /var/lib/rancher/rke2/server/tls/kube-controller-manager/kube-controller-manager.{crt,key}
crictl rm -f $(crictl ps -q --name kube-controller-manager)

rm /var/lib/rancher/rke2/server/tls/kube-scheduler/kube-scheduler.{crt,key}
crictl rm -f $(crictl ps -q --name kube-scheduler)

mysterious-animal-29850

01/17/2025, 7:59 PM

If you did ^this then usually I'll move the pod manifest for scheduler and controller into a tmp location let the pod terminate and then add it back. I'd do it one control plane node at a time.

bitter-energy-61444

01/17/2025, 8:03 PM

yea, i've done that and confirmed:

Copy code

[root@control0 ~]# (
> curl --cacert /var/lib/rancher/rke2/server/tls/kube-controller-manager/kube-controller-manager.crt \
>  <https://127.0.0.1:10257/healthz> >/dev/null 2>&1 \
>  && echo "[OK] Kube Controller probe" \
>  || echo "[FAIL] Kube Controller probe";
> 
> curl --cacert /var/lib/rancher/rke2/server/tls/kube-scheduler/kube-scheduler.crt \
>  <https://127.0.0.1:10259/healthz> >/dev/null 2>&1 \
>  && echo "[OK] Scheduler probe" \
>  || echo "[FAIL] Scheduler probe";
> )
[OK] Kube Controller probe
[OK] Scheduler probe
[root@control0 ~]#

bitter-energy-61444

01/17/2025, 8:03 PM

it was bad before, to be clear, but i have done that procedure and now the checks are all OK on the control plane nodes

bitter-energy-61444

01/17/2025, 8:04 PM

(regardless of what the rancher ui says)

mysterious-animal-29850

01/17/2025, 8:07 PM

^ good move deleting the domain. Not sure what else to try, I've had that issue before but clearing the old certs and forcing a restart of those pods usually cleared. Hope someone else has some insight.

bitter-energy-61444

01/17/2025, 8:07 PM

thanks for the thoughts, at least

👍 1

colossal-napkin-82795

04/08/2025, 7:22 AM

I'm facing the same issue on a newer version of rancher and rke2 cluster: https://github.com/rancher/rancher/issues/49757 Could you describe the mentioned above steps "deleting the domain"

108 Views

Open in Slack

Previous Next