tall-translator-73410
04/11/2023, 8:37 AMrancher-system-agent
=> no effect
• restarting nodes => no effect
• upgrade to a more recent version of RKE2 even if previous was not fully deployed => no effect (still one node up-to-date with probes down)
• upgrade to a more recent version of Rancher (2.6.9 => 2.7.1) => no effect
Clusters are healthy from k8s point of view, ETCD cluster is healthy with all members sync, scheduling and controllers are working correctly.
We are not sure if it's the root cause, but we found some articles about a change in the insecure to secure port for controller and scheduler in recent versions ok k8s, may this be a problem of wrong port used in check ?
Does anyone knows if it's really rancher-system-agent that is in charge of probing scheduler and controller ? How to check probes config ?
Note: This only affects rke2 clusters managed by Rancher, for cluster deployed manually with RKE2 and then imported into Rancher upgrades works without any issue.creamy-pencil-82913
04/11/2023, 9:29 AMtall-translator-73410
04/11/2023, 9:33 AMjolly-processor-88759
04/11/2023, 1:03 PMcrictl ps
and rancher-system-agent logs to see if there are any errors?tall-translator-73410
04/11/2023, 1:35 PM# journalctl -u rancher-system-agent -n 2000 |grep -i error
Apr 10 13:11:35 <nodename> rancher-system-agent[50346]: time="2023-04-10T13:11:35Z" level=error msg="[K8s] received secret to process that was older than the last secret operated on. (369730905 vs 369730954)"
Apr 10 13:11:35 <nodename> rancher-system-agent[50346]: time="2023-04-10T13:11:35Z" level=error msg="error syncing 'fleet-default/custom-6ab162c666c9-machine-plan': handler secret-watch: secret received was too old, requeuing"
with crictl, I can tell that both kube-controller-manager-<node>
and kube-scheduler-<node>
are Running
In kubelet log there are several errors talink about failed to sync secret cache: timed out waiting for the condition
or like this one :
E0410 22:34:51.109948 1396 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"kube-controller-manager\" with CrashLoopBackOff: \"back-off 10s restarting failed container=kube-controller-manager pod=kube-controller-manager-<node-name>_kube-system(57585a0305e4e46df816ebab263926f3)\"" pod="kube-system/kube-controller-manager-<node-name>" podUID=57585a0305e4e46df816ebab263926f3
jolly-processor-88759
04/11/2023, 1:36 PMtall-translator-73410
04/11/2023, 1:41 PMjolly-processor-88759
04/11/2023, 1:42 PMtall-translator-73410
04/11/2023, 1:43 PM# crictl logs 74fa73b170537 2>&1 |grep -i error
I0411 06:36:59.217837 1 event.go:294] "Event occurred" object="<namespace>/cm-acme-http-solver-v2cq5" fieldPath="" kind="Endpoints" apiVersion="v1" type="Warning" reason="FailedToUpdateEndpoint" message="Failed to update endpoint <namespace>/cm-acme-http-solver-v2cq5: Operation cannot be fulfilled on endpoints \"cm-acme-http-solver-v2cq5\": StorageError: invalid object, Code: 4, Key: /registry/services/endpoints/<namespace>/cm-acme-http-solver-v2cq5, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: 7eeee381-5e72-4a7e-a24e-1089b7d40156, UID in object meta: "
jolly-processor-88759
04/11/2023, 1:45 PMtall-translator-73410
04/11/2023, 1:45 PMjolly-processor-88759
04/11/2023, 1:45 PMtall-translator-73410
04/11/2023, 1:48 PMjolly-processor-88759
04/11/2023, 1:52 PMtall-translator-73410
04/11/2023, 1:53 PM# journalctl -u rke2-server -n 4000 |grep -i error |cut -c 27- |sed -e 's/2023-[^Z]*Z/TIMEREDACTED"/' |sort |uniq -c|sort -n |grep -v "^ 1"
3 rke2[877]: time="TIMEREDACTED"" level=info msg="error in remotedialer server [400]: websocket: close 1006 (abnormal closure): unexpected EOF"
4 rke2[366270]: time="TIMEREDACTED"" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: <https://127.0.0.1:9345/v1-rke2/readyz>: 500 Internal Server Error"
4 rke2[367289]: time="TIMEREDACTED"" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: <https://127.0.0.1:9345/v1-rke2/readyz>: 500 Internal Server Error"
4 rke2[877]: time="TIMEREDACTED"" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: <https://127.0.0.1:9345/v1-rke2/readyz>: 500 Internal Server Error"
5 rke2[877]: time="TIMEREDACTED"" level=warning msg="Proxy error: write failed: io: read/write on closed pipe"
9 rke2[387056]: time="TIMEREDACTED"" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: <https://127.0.0.1:9345/v1-rke2/readyz>: 500 Internal Server Error"
11 rke2[387056]: time="TIMEREDACTED"" level=info msg="error in remotedialer server [400]: websocket: close 1006 (abnormal closure): unexpected EOF"
15 rke2[912]: time="TIMEREDACTED"" level=info msg="error in remotedialer server [400]: websocket: close 1006 (abnormal closure): unexpected EOF"
162 rke2[912]: time="TIMEREDACTED"" level=warning msg="Proxy error: write failed: io: read/write on closed pipe"
fleet-default
,secret custom-<nodeid>-machine-plan
-> applied-checksum
-> appliedPlan
-> failed-checksum
-> failed-output
-> failure-count
-> last-apply-time
-> plan
-> probe-statuses
-> success-count
-> applied-output
-> applied-periodic-output
-> failure-threshold
-> max-failures
jolly-processor-88759
04/11/2023, 2:04 PMtall-translator-73410
04/11/2023, 2:04 PM{
"calico": {
"healthy": true,
"successCount": 1
},
"etcd": {
"healthy": true,
"successCount": 1
},
"kube-apiserver": {
"healthy": true,
"successCount": 1
},
"kube-controller-manager": {
"failureCount": 2
},
"kube-scheduler": {
"failureCount": 2
},
"kubelet": {
"healthy": true,
"successCount": 1
}
}
probe-statuses
jolly-processor-88759
04/11/2023, 2:05 PMplan
and appliedPlan
tall-translator-73410
04/11/2023, 2:11 PMplan
key and appliedPlan
are stricly the same data# k view-secret -n fleet-default custom-<nodeid>-machine-plan plan |jq '.probes."kube-controller-manager"'
{
"initialDelaySeconds": 1,
"timeoutSeconds": 5,
"successThreshold": 1,
"failureThreshold": 2,
"httpGet": {
"url": "<https://127.0.0.1:10257/healthz>",
"caCert": "/var/lib/rancher/rke2/server/tls/kube-controller-manager/kube-controller-manager.crt"
}
}
By calling the URL with curl ignoring cert, it's kube-controller-manager answers "ok" :
# curl -k <https://127.0.0.1:10257/healthz>
ok
And when using the expected CAcert :
# curl --cacert /var/lib/rancher/rke2/server/tls/kube-controller-manager/kube-controller-manager.crt <https://127.0.0.1:10257/healthz>
curl: (60) SSL certificate problem: certificate has expired
More details here: <https://curl.haxx.se/docs/sslcerts.html>
curl failed to verify the legitimacy of the server and therefore could not
establish a secure connection to it. To learn more about this situation and
how to fix it, please visit the web page mentioned above.
So it seems to be a certificate problemjolly-processor-88759
04/11/2023, 2:30 PM*openssl x509 -text -in* /var/lib/rancher/rke2/server/tls/kube-controller-manager/kube-controller-manager.crt *| grep -A 2 Validity*
tall-translator-73410
04/11/2023, 2:33 PM# openssl x509 -text -in /var/lib/rancher/rke2/server/tls/kube-controller-manager/kube-controller-manager.crt | grep -A 2 Validity
Validity
Not Before: Dec 29 13:14:28 2021 GMT
Not After : Dec 29 13:14:28 2022 GMT
jolly-processor-88759
04/11/2023, 2:35 PMtall-translator-73410
04/11/2023, 2:36 PMjolly-processor-88759
04/11/2023, 2:37 PMtall-translator-73410
04/11/2023, 2:39 PMjolly-processor-88759
04/11/2023, 2:40 PMtall-translator-73410
04/11/2023, 2:40 PMjolly-processor-88759
04/11/2023, 2:41 PMtall-translator-73410
04/11/2023, 2:42 PMjolly-processor-88759
04/11/2023, 2:42 PMtall-translator-73410
04/11/2023, 2:42 PMjolly-processor-88759
04/11/2023, 2:43 PMtall-translator-73410
04/11/2023, 2:43 PMjolly-processor-88759
04/11/2023, 2:43 PMtall-translator-73410
04/11/2023, 2:43 PMjolly-processor-88759
04/11/2023, 2:44 PMtall-translator-73410
04/11/2023, 2:44 PMWaiting for probes: kube-controller-manager, kube-scheduler
issue is to force a certificate rotation of these services, but as upgrade is stuck, clicking on "Rotate certificates" in Rancher UI does nothing.
Does anyone know how to trigger cert rotation in that case ?creamy-pencil-82913
04/11/2023, 4:43 PMtall-translator-73410
04/11/2023, 4:45 PMcreamy-pencil-82913
04/11/2023, 4:46 PMtall-translator-73410
04/11/2023, 4:50 PMcreamy-pencil-82913
04/12/2023, 5:05 AMtall-translator-73410
04/12/2023, 8:27 AMadventurous-magazine-13224
04/12/2023, 12:40 PMsudo openssl x509 -text -in /var/lib/rancher/rke2/server/tls/kube-controller-manager/kube-controller-manager.crt | grep -A 2 Validity
Validity
Not Before: Apr 12 09:18:14 2022 GMT
Not After : Apr 12 09:18:14 2023 GMT
tall-translator-73410
04/12/2023, 12:42 PMWaiting for probes: kube-controller-manager, kube-scheduler
problembusy-flag-55906
04/12/2023, 12:44 PMtall-translator-73410
04/12/2023, 12:45 PMbusy-flag-55906
04/12/2023, 12:49 PMadventurous-magazine-13224
04/12/2023, 12:51 PMcreamy-pencil-82913
04/12/2023, 3:49 PMtall-translator-73410
04/12/2023, 3:59 PMcreamy-pencil-82913
04/12/2023, 9:01 PMwide-receptionist-90874
04/12/2023, 9:01 PMpskill kube-controller-manager
and pskill kube-scheduler
maybe?creamy-pencil-82913
04/12/2023, 9:01 PMwide-receptionist-90874
04/12/2023, 9:02 PMtall-translator-73410
04/13/2023, 9:01 AM(
curl --cacert /var/lib/rancher/rke2/server/tls/kube-controller-manager/kube-controller-manager.crt \
<https://127.0.0.1:10257/healthz> >/dev/null 2>&1 \
&& echo "[OK] Kube Controller probe" \
|| echo "[FAIL] Kube Controller probe";
curl --cacert /var/lib/rancher/rke2/server/tls/kube-scheduler/kube-scheduler.crt \
<https://127.0.0.1:10259/healthz> >/dev/null 2>&1 \
&& echo "[OK] Scheduler probe" \
|| echo "[FAIL] Scheduler probe";
)
And below commands I used to force certificate rotation on failed probes :
echo "Rotating kube-controller-manager certificate"
rm /var/lib/rancher/rke2/server/tls/kube-controller-manager/kube-controller-manager.{crt,key}
crictl rm -f $(crictl ps -q --name kube-controller-manager)
echo "Rotating kube-scheduler certificate"
rm /var/lib/rancher/rke2/server/tls/kube-scheduler/kube-scheduler.{crt,key}
crictl rm -f $(crictl ps -q --name kube-scheduler)
creamy-wolf-46823
04/13/2023, 9:42 AMtall-translator-73410
04/13/2023, 10:29 AMjolly-processor-88759
04/13/2023, 1:12 PMtall-translator-73410
04/13/2023, 1:14 PMadventurous-magazine-13224
04/13/2023, 1:15 PMbusy-flag-55906
04/13/2023, 1:21 PMjolly-processor-88759
04/13/2023, 1:53 PMwide-receptionist-90874
04/13/2023, 4:24 PMbest-microphone-20624
04/13/2023, 5:03 PMwide-receptionist-90874
04/13/2023, 7:43 PMjolly-processor-88759
04/14/2023, 12:16 PM/var/lib/rancher/rke2/server/tls/
kube-controller-manager
and kube-scheduler
folders do not exists on a non-downstream provisioned RKE2 cluster? I can see it on all of our down stream provisioned systems but not our MCM cluster.creamy-pencil-82913
04/15/2023, 12:09 AM