This message was deleted.
# general
a
This message was deleted.
b
This all looks like it went back to an expired set of certificates on the servers. That has been reconciled with the procedure mentioned here https://github.com/rancher/rancher/issues/41125
so my question is generally, how do i fix the certificate rotation on a worker node and how do I resolve the stuck probes?
workers running v1.28.15+rke2r1, rancher 2.10.1
i have a sneaking suspicion it's stuck in an upgrade but i'm not sure how to check/fix that
m
Check the journal logs for rke2-server on the host.
b
that's the thing, nothing terribly enlightening in that or the rke2-agent logs
m
On worker nodes you normally don't need to do anything, only on control-plane/etcd nodes. If you are seeing connectivity issues on worker nodes, I'd try restarting the rke2-agent systemd unit on each worker node. For extra safety, drain/cordon each node, then do the agent restart, then uncordon the node.
b
the updating certificate issue would seem to be the least of my issues immediately (workloads run fine). the nagging issue is all my control plane nodes in a reconciling state for one of "waiting for kubelet to update" "waiting for probes"
m
and just to confirm, you did these steps?
Copy code
rm /var/lib/rancher/rke2/server/tls/kube-controller-manager/kube-controller-manager.{crt,key}
crictl rm -f $(crictl ps -q --name kube-controller-manager)

rm /var/lib/rancher/rke2/server/tls/kube-scheduler/kube-scheduler.{crt,key}
crictl rm -f $(crictl ps -q --name kube-scheduler)
If you did ^this then usually I'll move the pod manifest for scheduler and controller into a tmp location let the pod terminate and then add it back. I'd do it one control plane node at a time.
b
yea, i've done that and confirmed:
Copy code
[root@control0 ~]# (
> curl --cacert /var/lib/rancher/rke2/server/tls/kube-controller-manager/kube-controller-manager.crt \
>  <https://127.0.0.1:10257/healthz> >/dev/null 2>&1 \
>  && echo "[OK] Kube Controller probe" \
>  || echo "[FAIL] Kube Controller probe";
> 
> curl --cacert /var/lib/rancher/rke2/server/tls/kube-scheduler/kube-scheduler.crt \
>  <https://127.0.0.1:10259/healthz> >/dev/null 2>&1 \
>  && echo "[OK] Scheduler probe" \
>  || echo "[FAIL] Scheduler probe";
> )
[OK] Kube Controller probe
[OK] Scheduler probe
[root@control0 ~]#
it was bad before, to be clear, but i have done that procedure and now the checks are all OK on the control plane nodes
(regardless of what the rancher ui says)
m
^ good move deleting the domain. Not sure what else to try, I've had that issue before but clearing the old certs and forcing a restart of those pods usually cleared. Hope someone else has some insight.
b
thanks for the thoughts, at least
👍 1
c
I'm facing the same issue on a newer version of rancher and rke2 cluster: https://github.com/rancher/rancher/issues/49757 Could you describe the mentioned above steps "deleting the domain"