https://rancher.com/ logo
Title
t

tall-translator-73410

04/11/2023, 8:37 AM
Hi everyone ! We are facing the very same problem as @busy-flag-55906 on all our clusters, upgrades are frozen with one/multiple controlplane node(s) stuck in "Waiting for probes: kube-controller-manager, kube-scheduler". Here are versions of clusters having the problem : • Rancher 2.7.1, RKE2 managed cluster upgraded from v1.24.7+rke2r1 to v1.24.9+rke2r2 (1/3 controlplane affected, 1/3 in waiting for plan) • Rancher 2.6.9, RKE2 managed cluster v1.24.7+rke2r1 (3/3 controlplanes affected) • Rancher 2.7.1, RKE2 managed cluster v1.24.8+rke2r1 (1/3 controlplane affected) • We also have one cluster not having the problem : • Rancher 2.6.9, RKE2 managed clsuter v1.24.7+rke2r1 These clusters are hosted in different hosting providers, they are all based on Ubuntu 20.04 or 22.04. We already tried to : • restart
rancher-system-agent
=> no effect • restarting nodes => no effect • upgrade to a more recent version of RKE2 even if previous was not fully deployed => no effect (still one node up-to-date with probes down) • upgrade to a more recent version of Rancher (2.6.9 => 2.7.1) => no effect Clusters are healthy from k8s point of view, ETCD cluster is healthy with all members sync, scheduling and controllers are working correctly. We are not sure if it's the root cause, but we found some articles about a change in the insecure to secure port for controller and scheduler in recent versions ok k8s, may this be a problem of wrong port used in check ? Does anyone knows if it's really rancher-system-agent that is in charge of probing scheduler and controller ? How to check probes config ? Note: This only affects rke2 clusters managed by Rancher, for cluster deployed manually with RKE2 and then imported into Rancher upgrades works without any issue.
c

creamy-pencil-82913

04/11/2023, 9:29 AM
Have you opened a support case or GitHub issue? The first being preferred if you have paid support...
t

tall-translator-73410

04/11/2023, 9:33 AM
We have no paid support, I can open a github issue.
I thought it was better to ask the slack community before to raise an issue, am I wrong ?
Issue opened : https://github.com/rancher/rancher/issues/41125 @busy-flag-55906, @fancy-oil-5019, @adorable-midnight-46384, @numerous-soccer-99009, @glamorous-lighter-5580, It seems that you are facing the very same problem, would you mind participate in this issue by giving your info about this issue ? 🙂 Thanks.
j

jolly-processor-88759

04/11/2023, 1:03 PM
@tall-translator-73410 did you look at the kubelet logs on the hosts? What pods are not running
crictl ps
and rancher-system-agent logs to see if there are any errors?
t

tall-translator-73410

04/11/2023, 1:35 PM
Here is rancher-system-agent log filtered on errors :
# journalctl -u rancher-system-agent -n 2000 |grep -i error
Apr 10 13:11:35 <nodename> rancher-system-agent[50346]: time="2023-04-10T13:11:35Z" level=error msg="[K8s] received secret to process that was older than the last secret operated on. (369730905 vs 369730954)"
Apr 10 13:11:35 <nodename> rancher-system-agent[50346]: time="2023-04-10T13:11:35Z" level=error msg="error syncing 'fleet-default/custom-6ab162c666c9-machine-plan': handler secret-watch: secret received was too old, requeuing"
with crictl, I can tell that both
kube-controller-manager-<node>
and
kube-scheduler-<node>
are Running In kubelet log there are several errors talink about
failed to sync secret cache: timed out waiting for the condition
or like this one :
E0410 22:34:51.109948    1396 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"kube-controller-manager\" with CrashLoopBackOff: \"back-off 10s restarting failed container=kube-controller-manager pod=kube-controller-manager-<node-name>_kube-system(57585a0305e4e46df816ebab263926f3)\"" pod="kube-system/kube-controller-manager-<node-name>" podUID=57585a0305e4e46df816ebab263926f3
j

jolly-processor-88759

04/11/2023, 1:36 PM
Go check the kube-controller-manager logs next. Also since rancher-system-agent is saying secret too old, restart rancher-system-agent
also add these logs to the github issue
It sounds like you may have a config.yaml error that you are passing down from Rancher. Did you have a feature flag on or kube-controller args that are now deprecated?
t

tall-translator-73410

04/11/2023, 1:41 PM
There is no error in kube-controller-manager container logs
I double check cluster config, but I'm quite sure that we have used default parameters (we just removed ingress controller from main screen)
Note that they are all clusters created hundreds of days ago
j

jolly-processor-88759

04/11/2023, 1:42 PM
you might want to grep kubelet.log for "error"
but this does really look like a rancher-system-agent issue
t

tall-translator-73410

04/11/2023, 1:43 PM
Yes, it is empty , the only error is about a cert-manager solver that is okay to fail :
# crictl logs 74fa73b170537 2>&1 |grep -i error
I0411 06:36:59.217837       1 event.go:294] "Event occurred" object="<namespace>/cm-acme-http-solver-v2cq5" fieldPath="" kind="Endpoints" apiVersion="v1" type="Warning" reason="FailedToUpdateEndpoint" message="Failed to update endpoint <namespace>/cm-acme-http-solver-v2cq5: Operation cannot be fulfilled on endpoints \"cm-acme-http-solver-v2cq5\": StorageError: invalid object, Code: 4, Key: /registry/services/endpoints/<namespace>/cm-acme-http-solver-v2cq5, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: 7eeee381-5e72-4a7e-a24e-1089b7d40156, UID in object meta: "
j

jolly-processor-88759

04/11/2023, 1:45 PM
check rke2-server's logs
t

tall-translator-73410

04/11/2023, 1:45 PM
I think it's just a wrong probe config problem. both controller and sheduler are acting good in the cluster, do you know where rancher-system-agent take its config from ?
j

jolly-processor-88759

04/11/2023, 1:45 PM
That's not the best ticket haha
t

tall-translator-73410

04/11/2023, 1:48 PM
There is much more errors in rke2-server logs
j

jolly-processor-88759

04/11/2023, 1:52 PM
You can check rancher logs next
Rancher-system-agent is just the client side of the downstream provisioner in Rancher. so its likely translating the machine config and instructions from the API and calling it a secret, but i haven't nailed it down yet
t

tall-translator-73410

04/11/2023, 1:53 PM
Here are rke2-server logs that occurs more than once :
# journalctl -u rke2-server -n 4000 |grep -i error |cut -c 27- |sed -e 's/2023-[^Z]*Z/TIMEREDACTED"/' |sort |uniq -c|sort -n |grep -v "^      1"
      3 rke2[877]: time="TIMEREDACTED"" level=info msg="error in remotedialer server [400]: websocket: close 1006 (abnormal closure): unexpected EOF"
      4 rke2[366270]: time="TIMEREDACTED"" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: <https://127.0.0.1:9345/v1-rke2/readyz>: 500 Internal Server Error"
      4 rke2[367289]: time="TIMEREDACTED"" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: <https://127.0.0.1:9345/v1-rke2/readyz>: 500 Internal Server Error"
      4 rke2[877]: time="TIMEREDACTED"" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: <https://127.0.0.1:9345/v1-rke2/readyz>: 500 Internal Server Error"
      5 rke2[877]: time="TIMEREDACTED"" level=warning msg="Proxy error: write failed: io: read/write on closed pipe"
      9 rke2[387056]: time="TIMEREDACTED"" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: <https://127.0.0.1:9345/v1-rke2/readyz>: 500 Internal Server Error"
     11 rke2[387056]: time="TIMEREDACTED"" level=info msg="error in remotedialer server [400]: websocket: close 1006 (abnormal closure): unexpected EOF"
     15 rke2[912]: time="TIMEREDACTED"" level=info msg="error in remotedialer server [400]: websocket: close 1006 (abnormal closure): unexpected EOF"
    162 rke2[912]: time="TIMEREDACTED"" level=warning msg="Proxy error: write failed: io: read/write on closed pipe"
I saw that rancher-system-agent was referring to a secret in rancher cluster, namespace
fleet-default
,secret
custom-<nodeid>-machine-plan
But this secret has several keys, and I don't know really where to watch to confirm secret "outdated" state :
-> applied-checksum
-> appliedPlan
-> failed-checksum
-> failed-output
-> failure-count
-> last-apply-time
-> plan
-> probe-statuses
-> success-count
-> applied-output
-> applied-periodic-output
-> failure-threshold
-> max-failures
j

jolly-processor-88759

04/11/2023, 2:04 PM
So in that secret there is a probe-statuses at the bottom I imagine this is where it shows kube-scheduler/kube-controller being not ready?
t

tall-translator-73410

04/11/2023, 2:04 PM
yes !
{
  "calico": {
    "healthy": true,
    "successCount": 1
  },
  "etcd": {
    "healthy": true,
    "successCount": 1
  },
  "kube-apiserver": {
    "healthy": true,
    "successCount": 1
  },
  "kube-controller-manager": {
    "failureCount": 2
  },
  "kube-scheduler": {
    "failureCount": 2
  },
  "kubelet": {
    "healthy": true,
    "successCount": 1
  }
}
This is the content of the secret key
probe-statuses
j

jolly-processor-88759

04/11/2023, 2:05 PM
so in that plan json you cans ee the current cloud-init and config.yaml files being applied
the config.yaml file for downstream provisioned will be the one with the path "/etc/rancher/rke2/config.yaml.d/50-rancher.yaml"
you can also see the last applied, and compare the two maybe? if they aren't updated to the same already
plan
and
appliedPlan
t

tall-translator-73410

04/11/2023, 2:11 PM
I'm going to check that (just fixing a prod issue, not related to this 🙂 )
👍 1
plan
key and
appliedPlan
are stricly the same data
Very interesting, the key "plan" contains a json with two main keys : files and probes. I'll try to manually check kube-controller probe
# k view-secret -n fleet-default custom-<nodeid>-machine-plan plan |jq '.probes."kube-controller-manager"'
{
  "initialDelaySeconds": 1,
  "timeoutSeconds": 5,
  "successThreshold": 1,
  "failureThreshold": 2,
  "httpGet": {
    "url": "<https://127.0.0.1:10257/healthz>",
    "caCert": "/var/lib/rancher/rke2/server/tls/kube-controller-manager/kube-controller-manager.crt"
  }
}
By calling the URL with curl ignoring cert, it's kube-controller-manager answers "ok" :
# curl -k <https://127.0.0.1:10257/healthz>
ok
And when using the expected CAcert :
# curl --cacert /var/lib/rancher/rke2/server/tls/kube-controller-manager/kube-controller-manager.crt <https://127.0.0.1:10257/healthz>
curl: (60) SSL certificate problem: certificate has expired
More details here: <https://curl.haxx.se/docs/sslcerts.html>

curl failed to verify the legitimacy of the server and therefore could not
establish a secure connection to it. To learn more about this situation and
how to fix it, please visit the web page mentioned above.
So it seems to be a certificate problem
j

jolly-processor-88759

04/11/2023, 2:30 PM
Usually that shows up in the logs we checked before. hrm and you said only 100 days, i think the default is 365 on certs
*openssl x509 -text -in* /var/lib/rancher/rke2/server/tls/kube-controller-manager/kube-controller-manager.crt *| grep -A 2 Validity*
that should get you the dates of the cert
t

tall-translator-73410

04/11/2023, 2:33 PM
I check it right now, (I was cross-posting results in the issue)
It is expired since a looooooooong time (4 months) 😄
# openssl x509 -text -in /var/lib/rancher/rke2/server/tls/kube-controller-manager/kube-controller-manager.crt | grep -A 2 Validity
        Validity
            Not Before: Dec 29 13:14:28 2021 GMT
            Not After : Dec 29 13:14:28 2022 GMT
💯 1
So maybe try to renew certificates from Rancher UI ?
j

jolly-processor-88759

04/11/2023, 2:35 PM
you guess is as good as mine 😄
not sure if it will do it while the cluster is in provisioning like this
@creamy-pencil-82913 Can you advise?
t

tall-translator-73410

04/11/2023, 2:36 PM
(First of all, thanks a lot for your help, even if it's not the real root cause, I d'ont want to wait for the resolution to thank you 🙂)
j

jolly-processor-88759

04/11/2023, 2:37 PM
You're welcome. Hopefully Brad can give you an quick answer, Might throw that question in general on how to fix the certs to see if anyone else has experience
Its always bothered me that RKE2 doesn't self manage the certs
t

tall-translator-73410

04/11/2023, 2:39 PM
Yes, as fleet seems to be stuck, I'm not sure that clicking on "rotate certificate" will actually do the job ... but maybe, let's wait for Rancher's team feedback
We already had cert issue in the past on RKE1, but I was wuite sure that RKE2 was rotating certs at each upgrade, maybe I was wrong ...
j

jolly-processor-88759

04/11/2023, 2:40 PM
If you find out 😄 let me know so i can put in a maintenance plan for our clusters hahaha
t

tall-translator-73410

04/11/2023, 2:40 PM
haha 😄
j

jolly-processor-88759

04/11/2023, 2:41 PM
i just check ours and they have rolled once already on our oldest cluster. Maybe during RKE2 upgrades it does it? not sure the last time you upgraded
t

tall-translator-73410

04/11/2023, 2:42 PM
maybe only during major upgrades ?
j

jolly-processor-88759

04/11/2023, 2:42 PM
1.22 => 1.23 (Minor) maybe
t

tall-translator-73410

04/11/2023, 2:42 PM
yeah ... my bad ... when I say "major" in k8s it's minor in fact 😄
j

jolly-processor-88759

04/11/2023, 2:43 PM
yep, k8s (1) will never change hahaha
saw a meme about that the otherday
t

tall-translator-73410

04/11/2023, 2:43 PM
I quite consider that "1.24" is the major number 😄 and ".11" the minor ^^
j

jolly-processor-88759

04/11/2023, 2:43 PM
.11 is the patch
t

tall-translator-73410

04/11/2023, 2:43 PM
yeah ^^
j

jolly-processor-88759

04/11/2023, 2:44 PM
t

tall-translator-73410

04/11/2023, 2:44 PM
haha ! excellent 😄
So it seems that the fix for
Waiting for probes: kube-controller-manager, kube-scheduler
issue is to force a certificate rotation of these services, but as upgrade is stuck, clicking on "Rotate certificates" in Rancher UI does nothing. Does anyone know how to trigger cert rotation in that case ?
I try to cross-post to #general, this thread is so long that no one will ever read all this.
🙌 1
c

creamy-pencil-82913

04/11/2023, 4:43 PM
This sounds like an issue with the plan controller on the rancher side that causes it to constantly thrash the plan under certain conditions, which in turn causes rancher-system-agent to constantly restart rke2. This should be fixed in 2.7.2 which will release quite soon.
t

tall-translator-73410

04/11/2023, 4:45 PM
That sounds great ! I saw that rc10 just get out earlier, do you have any hint about release date ?
c

creamy-pencil-82913

04/11/2023, 4:46 PM
soon is the most I can say.
t

tall-translator-73410

04/11/2023, 4:50 PM
Ok thanks !
c

creamy-pencil-82913

04/12/2023, 5:05 AM
it’s out
t

tall-translator-73410

04/12/2023, 8:27 AM
Hi @creamy-pencil-82913! thanks 🙂
I just upgraded one of rancher to 2.7.2, is here anything to do to fix the cluster state, or just wait until it eventually auto-fix the problem ?
a

adventurous-magazine-13224

04/12/2023, 12:40 PM
I upgraded to 2.7.2 today, and thought that caused the same issue... just so happens that the cluster I'm seeing this issue on is 365 days old today! 😛
sudo openssl x509 -text -in /var/lib/rancher/rke2/server/tls/kube-controller-manager/kube-controller-manager.crt | grep -A 2 Validity
        Validity
            Not Before: Apr 12 09:18:14 2022 GMT
            Not After : Apr 12 09:18:14 2023 GMT
t

tall-translator-73410

04/12/2023, 12:42 PM
Unlucky ! But it does confirm that the problem is due to cert expiration
it seems that 2.7.2 does not fix the
Waiting for probes: kube-controller-manager, kube-scheduler
problem
b

busy-flag-55906

04/12/2023, 12:44 PM
Untitled
sad news, i was hoping for positive comments here. Indeed my ssls have been expired as well
t

tall-translator-73410

04/12/2023, 12:45 PM
Your feedback is important @busy-flag-55906 🙂 It confirms that our problem is due to this cert expiration !
If we find how to fix it, lot of people will benefits of this 🙂
I tried to force certificate rotation from Rancher UI, but it did nothing, I suppose due to the fact that the cluster is waiting to apply upgrade plan first
b

busy-flag-55906

04/12/2023, 12:49 PM
yes, i tried this as well with no luck
a

adventurous-magazine-13224

04/12/2023, 12:51 PM
I've just tried this too, but still no good 😞
c

creamy-pencil-82913

04/12/2023, 3:49 PM
Those certs are created by the controllers themselves, they’re not managed by either rancher or rke2. I’ll have to see how they can be renewed.
t

tall-translator-73410

04/12/2023, 3:59 PM
Thanks @creamy-pencil-82913! It would be awesome to unlock this situation 🙂
c

creamy-pencil-82913

04/12/2023, 9:01 PM
yeah, you can’t delete the shadow pods. You would need to delete the scheduler and controller-manager manifests from /var/lib/rancher/rke2/agent/pod-manifests/ and then restart rke2-server
w

wide-receptionist-90874

04/12/2023, 9:01 PM
I mean you could probably even
pskill kube-controller-manager
and
pskill kube-scheduler
maybe?
c

creamy-pencil-82913

04/12/2023, 9:01 PM
or use crictl to delete them
yeah any of those should work
w

wide-receptionist-90874

04/12/2023, 9:02 PM
I'm going to file an issue in rancher/rancher around this so we can fix this properly... this is no bueno
💯 1
👍 1
t

tall-translator-73410

04/13/2023, 9:01 AM
Deleting cert manually + force container restart with crictl is working !
Here is a shell command to check if probes are okay or not :
(
curl  --cacert /var/lib/rancher/rke2/server/tls/kube-controller-manager/kube-controller-manager.crt \
  <https://127.0.0.1:10257/healthz> >/dev/null 2>&1 \
  && echo "[OK] Kube Controller probe" \
  || echo "[FAIL] Kube Controller probe";

curl --cacert /var/lib/rancher/rke2/server/tls/kube-scheduler/kube-scheduler.crt \
  <https://127.0.0.1:10259/healthz> >/dev/null 2>&1  \
  && echo "[OK] Scheduler probe" \
  || echo "[FAIL] Scheduler probe";
)
And below commands I used to force certificate rotation on failed probes :
echo "Rotating kube-controller-manager certificate"
rm /var/lib/rancher/rke2/server/tls/kube-controller-manager/kube-controller-manager.{crt,key}
crictl rm -f $(crictl ps -q --name kube-controller-manager)

echo "Rotating kube-scheduler certificate"
rm /var/lib/rancher/rke2/server/tls/kube-scheduler/kube-scheduler.{crt,key}
crictl rm -f $(crictl ps -q --name kube-scheduler)
Thanks a lot @wide-receptionist-90874 and @creamy-pencil-82913! (FYI I already opened an issue on this problem : https://github.com/rancher/rancher/issues/41125)
c

creamy-wolf-46823

04/13/2023, 9:42 AM
You might consider changing the title of your issue now that the problem is better understood. An alternative might be: "Improve Rancher management of RKE2 clusters to handle expired kube-controller-manager and kube-scheduler certs".
t

tall-translator-73410

04/13/2023, 10:29 AM
If you have the problem, would you more likely find this issue with the previous title or the one you suggest ?
To me, your title may be a good one for a new "feature request" issue, not to give workaround and unlock impacted clusters, but maybe I'm wrong
j

jolly-processor-88759

04/13/2023, 1:12 PM
Great work @tall-translator-73410 glad it all go figured out!
🙇‍♂️ 2
t

tall-translator-73410

04/13/2023, 1:14 PM
Glad you helped 😁 👍
a

adventurous-magazine-13224

04/13/2023, 1:15 PM
Thanks everyone! 😄 This sorted out the problem quickly with a couple of our clusters! 🥳
👍 1
😛artyparrot: 1
b

busy-flag-55906

04/13/2023, 1:21 PM
thanks, i was able to fix the issue as well on the cluster
😛artyparrot: 1
👍 1
j

jolly-processor-88759

04/13/2023, 1:53 PM
I noticed that a standard RKE2 cluster (non downstream provisioned) appears to have different certificate locations that the downstream clusters. Anyone know if the fix steps works on that as well?
w

wide-receptionist-90874

04/13/2023, 4:24 PM
@tall-translator-73410 I find that I try to add as many relevant logs in the body of the issue, as it gets indexed by Google and people will tend to find issues that way. Thank you for filing that issue! I'll get it labeled/assigned/scheduled to fix. Currently I'm actually thinking we might try to implement a fix for this in a manner that allows an RKE2/K3s upgrade to fix it.
💯 3
😛artyparrot: 1
b

best-microphone-20624

04/13/2023, 5:03 PM
@tall-translator-73410 The current title for the issue can have many root causes including some that were indeed already fixed in 2.7.2. Just wanted to make sure the issue gets visibility within Rancher folks like @wide-receptionist-90874 and @creamy-pencil-82913 which already seems to be the case. Thanks for your troubleshooting efforts and sharing those with the community!
w

wide-receptionist-90874

04/13/2023, 7:43 PM
@jolly-processor-88759 yes, the action of deleting the certificate files + kicking the component should cause a new cert to be generated.
j

jolly-processor-88759

04/14/2023, 12:16 PM
@wide-receptionist-90874 Any idea why the
/var/lib/rancher/rke2/server/tls/
kube-controller-manager
and
kube-scheduler
folders do not exists on a non-downstream provisioned RKE2 cluster? I can see it on all of our down stream provisioned systems but not our MCM cluster.
c

creamy-pencil-82913

04/15/2023, 12:09 AM
They're only present on provisioned clusters, for health checking by the rancher agent. The rancher agent configures them, they're not present by default on rke2.