This message was deleted.
# general
a
This message was deleted.
c
Have you opened a support case or GitHub issue? The first being preferred if you have paid support...
t
We have no paid support, I can open a github issue.
I thought it was better to ask the slack community before to raise an issue, am I wrong ?
Issue opened : https://github.com/rancher/rancher/issues/41125 @busy-flag-55906, @fancy-oil-5019, @adorable-midnight-46384, @numerous-soccer-99009, @glamorous-lighter-5580, It seems that you are facing the very same problem, would you mind participate in this issue by giving your info about this issue ? 🙂 Thanks.
j
@tall-translator-73410 did you look at the kubelet logs on the hosts? What pods are not running
crictl ps
and rancher-system-agent logs to see if there are any errors?
t
Here is rancher-system-agent log filtered on errors :
Copy code
# journalctl -u rancher-system-agent -n 2000 |grep -i error
Apr 10 13:11:35 <nodename> rancher-system-agent[50346]: time="2023-04-10T13:11:35Z" level=error msg="[K8s] received secret to process that was older than the last secret operated on. (369730905 vs 369730954)"
Apr 10 13:11:35 <nodename> rancher-system-agent[50346]: time="2023-04-10T13:11:35Z" level=error msg="error syncing 'fleet-default/custom-6ab162c666c9-machine-plan': handler secret-watch: secret received was too old, requeuing"
with crictl, I can tell that both
kube-controller-manager-<node>
and
kube-scheduler-<node>
are Running In kubelet log there are several errors talink about
failed to sync secret cache: timed out waiting for the condition
or like this one :
Copy code
E0410 22:34:51.109948    1396 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"kube-controller-manager\" with CrashLoopBackOff: \"back-off 10s restarting failed container=kube-controller-manager pod=kube-controller-manager-<node-name>_kube-system(57585a0305e4e46df816ebab263926f3)\"" pod="kube-system/kube-controller-manager-<node-name>" podUID=57585a0305e4e46df816ebab263926f3
j
Go check the kube-controller-manager logs next. Also since rancher-system-agent is saying secret too old, restart rancher-system-agent
also add these logs to the github issue
It sounds like you may have a config.yaml error that you are passing down from Rancher. Did you have a feature flag on or kube-controller args that are now deprecated?
t
There is no error in kube-controller-manager container logs
I double check cluster config, but I'm quite sure that we have used default parameters (we just removed ingress controller from main screen)
Note that they are all clusters created hundreds of days ago
j
you might want to grep kubelet.log for "error"
but this does really look like a rancher-system-agent issue
t
Yes, it is empty , the only error is about a cert-manager solver that is okay to fail :
Copy code
# crictl logs 74fa73b170537 2>&1 |grep -i error
I0411 06:36:59.217837       1 event.go:294] "Event occurred" object="<namespace>/cm-acme-http-solver-v2cq5" fieldPath="" kind="Endpoints" apiVersion="v1" type="Warning" reason="FailedToUpdateEndpoint" message="Failed to update endpoint <namespace>/cm-acme-http-solver-v2cq5: Operation cannot be fulfilled on endpoints \"cm-acme-http-solver-v2cq5\": StorageError: invalid object, Code: 4, Key: /registry/services/endpoints/<namespace>/cm-acme-http-solver-v2cq5, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: 7eeee381-5e72-4a7e-a24e-1089b7d40156, UID in object meta: "
j
check rke2-server's logs
t
I think it's just a wrong probe config problem. both controller and sheduler are acting good in the cluster, do you know where rancher-system-agent take its config from ?
j
That's not the best ticket haha
t
There is much more errors in rke2-server logs
j
You can check rancher logs next
Rancher-system-agent is just the client side of the downstream provisioner in Rancher. so its likely translating the machine config and instructions from the API and calling it a secret, but i haven't nailed it down yet
t
Here are rke2-server logs that occurs more than once :
Copy code
# journalctl -u rke2-server -n 4000 |grep -i error |cut -c 27- |sed -e 's/2023-[^Z]*Z/TIMEREDACTED"/' |sort |uniq -c|sort -n |grep -v "^      1"
      3 rke2[877]: time="TIMEREDACTED"" level=info msg="error in remotedialer server [400]: websocket: close 1006 (abnormal closure): unexpected EOF"
      4 rke2[366270]: time="TIMEREDACTED"" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: <https://127.0.0.1:9345/v1-rke2/readyz>: 500 Internal Server Error"
      4 rke2[367289]: time="TIMEREDACTED"" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: <https://127.0.0.1:9345/v1-rke2/readyz>: 500 Internal Server Error"
      4 rke2[877]: time="TIMEREDACTED"" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: <https://127.0.0.1:9345/v1-rke2/readyz>: 500 Internal Server Error"
      5 rke2[877]: time="TIMEREDACTED"" level=warning msg="Proxy error: write failed: io: read/write on closed pipe"
      9 rke2[387056]: time="TIMEREDACTED"" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: <https://127.0.0.1:9345/v1-rke2/readyz>: 500 Internal Server Error"
     11 rke2[387056]: time="TIMEREDACTED"" level=info msg="error in remotedialer server [400]: websocket: close 1006 (abnormal closure): unexpected EOF"
     15 rke2[912]: time="TIMEREDACTED"" level=info msg="error in remotedialer server [400]: websocket: close 1006 (abnormal closure): unexpected EOF"
    162 rke2[912]: time="TIMEREDACTED"" level=warning msg="Proxy error: write failed: io: read/write on closed pipe"
I saw that rancher-system-agent was referring to a secret in rancher cluster, namespace
fleet-default
,secret
custom-<nodeid>-machine-plan
But this secret has several keys, and I don't know really where to watch to confirm secret "outdated" state :
Copy code
-> applied-checksum
-> appliedPlan
-> failed-checksum
-> failed-output
-> failure-count
-> last-apply-time
-> plan
-> probe-statuses
-> success-count
-> applied-output
-> applied-periodic-output
-> failure-threshold
-> max-failures
j
So in that secret there is a probe-statuses at the bottom I imagine this is where it shows kube-scheduler/kube-controller being not ready?
t
yes !
Copy code
{
  "calico": {
    "healthy": true,
    "successCount": 1
  },
  "etcd": {
    "healthy": true,
    "successCount": 1
  },
  "kube-apiserver": {
    "healthy": true,
    "successCount": 1
  },
  "kube-controller-manager": {
    "failureCount": 2
  },
  "kube-scheduler": {
    "failureCount": 2
  },
  "kubelet": {
    "healthy": true,
    "successCount": 1
  }
}
This is the content of the secret key
probe-statuses
j
so in that plan json you cans ee the current cloud-init and config.yaml files being applied
the config.yaml file for downstream provisioned will be the one with the path "/etc/rancher/rke2/config.yaml.d/50-rancher.yaml"
you can also see the last applied, and compare the two maybe? if they aren't updated to the same already
plan
and
appliedPlan
t
I'm going to check that (just fixing a prod issue, not related to this 🙂 )
👍 1
plan
key and
appliedPlan
are stricly the same data
Very interesting, the key "plan" contains a json with two main keys : files and probes. I'll try to manually check kube-controller probe
Copy code
# k view-secret -n fleet-default custom-<nodeid>-machine-plan plan |jq '.probes."kube-controller-manager"'
{
  "initialDelaySeconds": 1,
  "timeoutSeconds": 5,
  "successThreshold": 1,
  "failureThreshold": 2,
  "httpGet": {
    "url": "<https://127.0.0.1:10257/healthz>",
    "caCert": "/var/lib/rancher/rke2/server/tls/kube-controller-manager/kube-controller-manager.crt"
  }
}
By calling the URL with curl ignoring cert, it's kube-controller-manager answers "ok" :
Copy code
# curl -k <https://127.0.0.1:10257/healthz>
ok
And when using the expected CAcert :
Copy code
# curl --cacert /var/lib/rancher/rke2/server/tls/kube-controller-manager/kube-controller-manager.crt <https://127.0.0.1:10257/healthz>
curl: (60) SSL certificate problem: certificate has expired
More details here: <https://curl.haxx.se/docs/sslcerts.html>

curl failed to verify the legitimacy of the server and therefore could not
establish a secure connection to it. To learn more about this situation and
how to fix it, please visit the web page mentioned above.
So it seems to be a certificate problem
j
Usually that shows up in the logs we checked before. hrm and you said only 100 days, i think the default is 365 on certs
*openssl x509 -text -in* /var/lib/rancher/rke2/server/tls/kube-controller-manager/kube-controller-manager.crt *| grep -A 2 Validity*
that should get you the dates of the cert
t
I check it right now, (I was cross-posting results in the issue)
It is expired since a looooooooong time (4 months) 😄
Copy code
# openssl x509 -text -in /var/lib/rancher/rke2/server/tls/kube-controller-manager/kube-controller-manager.crt | grep -A 2 Validity
        Validity
            Not Before: Dec 29 13:14:28 2021 GMT
            Not After : Dec 29 13:14:28 2022 GMT
💯 1
So maybe try to renew certificates from Rancher UI ?
j
you guess is as good as mine 😄
not sure if it will do it while the cluster is in provisioning like this
@creamy-pencil-82913 Can you advise?
t
(First of all, thanks a lot for your help, even if it's not the real root cause, I d'ont want to wait for the resolution to thank you 🙂)
j
You're welcome. Hopefully Brad can give you an quick answer, Might throw that question in general on how to fix the certs to see if anyone else has experience
Its always bothered me that RKE2 doesn't self manage the certs
t
Yes, as fleet seems to be stuck, I'm not sure that clicking on "rotate certificate" will actually do the job ... but maybe, let's wait for Rancher's team feedback
We already had cert issue in the past on RKE1, but I was wuite sure that RKE2 was rotating certs at each upgrade, maybe I was wrong ...
j
If you find out 😄 let me know so i can put in a maintenance plan for our clusters hahaha
t
haha 😄
j
i just check ours and they have rolled once already on our oldest cluster. Maybe during RKE2 upgrades it does it? not sure the last time you upgraded
t
maybe only during major upgrades ?
j
1.22 => 1.23 (Minor) maybe
t
yeah ... my bad ... when I say "major" in k8s it's minor in fact 😄
j
yep, k8s (1) will never change hahaha
saw a meme about that the otherday
t
I quite consider that "1.24" is the major number 😄 and ".11" the minor ^^
j
.11 is the patch
t
yeah ^^
j
t
haha ! excellent 😄
So it seems that the fix for
Waiting for probes: kube-controller-manager, kube-scheduler
issue is to force a certificate rotation of these services, but as upgrade is stuck, clicking on "Rotate certificates" in Rancher UI does nothing. Does anyone know how to trigger cert rotation in that case ?
I try to cross-post to #general, this thread is so long that no one will ever read all this.
🙌 1
c
This sounds like an issue with the plan controller on the rancher side that causes it to constantly thrash the plan under certain conditions, which in turn causes rancher-system-agent to constantly restart rke2. This should be fixed in 2.7.2 which will release quite soon.
t
That sounds great ! I saw that rc10 just get out earlier, do you have any hint about release date ?
c
soon is the most I can say.
t
Ok thanks !
c
it’s out
t
Hi @creamy-pencil-82913! thanks 🙂
I just upgraded one of rancher to 2.7.2, is here anything to do to fix the cluster state, or just wait until it eventually auto-fix the problem ?
a
I upgraded to 2.7.2 today, and thought that caused the same issue... just so happens that the cluster I'm seeing this issue on is 365 days old today! 😛
Copy code
sudo openssl x509 -text -in /var/lib/rancher/rke2/server/tls/kube-controller-manager/kube-controller-manager.crt | grep -A 2 Validity
        Validity
            Not Before: Apr 12 09:18:14 2022 GMT
            Not After : Apr 12 09:18:14 2023 GMT
t
Unlucky ! But it does confirm that the problem is due to cert expiration
it seems that 2.7.2 does not fix the
Waiting for probes: kube-controller-manager, kube-scheduler
problem
b
sad news, i was hoping for positive comments here. Indeed my ssls have been expired as well
t
Your feedback is important @busy-flag-55906 🙂 It confirms that our problem is due to this cert expiration !
If we find how to fix it, lot of people will benefits of this 🙂
I tried to force certificate rotation from Rancher UI, but it did nothing, I suppose due to the fact that the cluster is waiting to apply upgrade plan first
b
yes, i tried this as well with no luck
a
I've just tried this too, but still no good 😞
c
Those certs are created by the controllers themselves, they’re not managed by either rancher or rke2. I’ll have to see how they can be renewed.
t
Thanks @creamy-pencil-82913! It would be awesome to unlock this situation 🙂
c
yeah, you can’t delete the shadow pods. You would need to delete the scheduler and controller-manager manifests from /var/lib/rancher/rke2/agent/pod-manifests/ and then restart rke2-server
w
I mean you could probably even
pskill kube-controller-manager
and
pskill kube-scheduler
maybe?
c
or use crictl to delete them
yeah any of those should work
w
I'm going to file an issue in rancher/rancher around this so we can fix this properly... this is no bueno
💯 1
👍 1
t
Deleting cert manually + force container restart with crictl is working !
Here is a shell command to check if probes are okay or not :
Copy code
(
curl  --cacert /var/lib/rancher/rke2/server/tls/kube-controller-manager/kube-controller-manager.crt \
  <https://127.0.0.1:10257/healthz> >/dev/null 2>&1 \
  && echo "[OK] Kube Controller probe" \
  || echo "[FAIL] Kube Controller probe";

curl --cacert /var/lib/rancher/rke2/server/tls/kube-scheduler/kube-scheduler.crt \
  <https://127.0.0.1:10259/healthz> >/dev/null 2>&1  \
  && echo "[OK] Scheduler probe" \
  || echo "[FAIL] Scheduler probe";
)
And below commands I used to force certificate rotation on failed probes :
Copy code
echo "Rotating kube-controller-manager certificate"
rm /var/lib/rancher/rke2/server/tls/kube-controller-manager/kube-controller-manager.{crt,key}
crictl rm -f $(crictl ps -q --name kube-controller-manager)

echo "Rotating kube-scheduler certificate"
rm /var/lib/rancher/rke2/server/tls/kube-scheduler/kube-scheduler.{crt,key}
crictl rm -f $(crictl ps -q --name kube-scheduler)
Thanks a lot @wide-receptionist-90874 and @creamy-pencil-82913! (FYI I already opened an issue on this problem : https://github.com/rancher/rancher/issues/41125)
c
You might consider changing the title of your issue now that the problem is better understood. An alternative might be: "Improve Rancher management of RKE2 clusters to handle expired kube-controller-manager and kube-scheduler certs".
t
If you have the problem, would you more likely find this issue with the previous title or the one you suggest ?
To me, your title may be a good one for a new "feature request" issue, not to give workaround and unlock impacted clusters, but maybe I'm wrong
j
Great work @tall-translator-73410 glad it all go figured out!
🙇‍♂️ 2
t
Glad you helped 😁 👍
a
Thanks everyone! 😄 This sorted out the problem quickly with a couple of our clusters! 🥳
👍 1
🦜 1
b
thanks, i was able to fix the issue as well on the cluster
🦜 1
👍 1
j
I noticed that a standard RKE2 cluster (non downstream provisioned) appears to have different certificate locations that the downstream clusters. Anyone know if the fix steps works on that as well?
w
@tall-translator-73410 I find that I try to add as many relevant logs in the body of the issue, as it gets indexed by Google and people will tend to find issues that way. Thank you for filing that issue! I'll get it labeled/assigned/scheduled to fix. Currently I'm actually thinking we might try to implement a fix for this in a manner that allows an RKE2/K3s upgrade to fix it.
💯 3
🦜 1
b
@tall-translator-73410 The current title for the issue can have many root causes including some that were indeed already fixed in 2.7.2. Just wanted to make sure the issue gets visibility within Rancher folks like @wide-receptionist-90874 and @creamy-pencil-82913 which already seems to be the case. Thanks for your troubleshooting efforts and sharing those with the community!
w
@jolly-processor-88759 yes, the action of deleting the certificate files + kicking the component should cause a new cert to be generated.
j
@wide-receptionist-90874 Any idea why the
/var/lib/rancher/rke2/server/tls/
kube-controller-manager
and
kube-scheduler
folders do not exists on a non-downstream provisioned RKE2 cluster? I can see it on all of our down stream provisioned systems but not our MCM cluster.
c
They're only present on provisioned clusters, for health checking by the rancher agent. The rancher agent configures them, they're not present by default on rke2.
942 Views