This message was deleted Rancher Users #general

Join Slack

This message was deleted.

# general

adamant-kite-43734

04/11/2023, 8:37 AM

This message was deleted.

creamy-pencil-82913

04/11/2023, 9:29 AM

Have you opened a support case or GitHub issue? The first being preferred if you have paid support...

tall-translator-73410

04/11/2023, 9:33 AM

We have no paid support, I can open a github issue.

tall-translator-73410

04/11/2023, 9:35 AM

I thought it was better to ask the slack community before to raise an issue, am I wrong ?

tall-translator-73410

04/11/2023, 10:56 AM

Issue opened : https://github.com/rancher/rancher/issues/41125 @busy-flag-55906, @fancy-oil-5019, @adorable-midnight-46384, @numerous-soccer-99009, @glamorous-lighter-5580, It seems that you are facing the very same problem, would you mind participate in this issue by giving your info about this issue ? 🙂 Thanks.

jolly-processor-88759

04/11/2023, 1:03 PM

@tall-translator-73410 did you look at the kubelet logs on the hosts? What pods are not running

crictl ps

and rancher-system-agent logs to see if there are any errors?

tall-translator-73410

04/11/2023, 1:35 PM

Here is rancher-system-agent log filtered on errors :

Copy code

# journalctl -u rancher-system-agent -n 2000 |grep -i error
Apr 10 13:11:35 <nodename> rancher-system-agent[50346]: time="2023-04-10T13:11:35Z" level=error msg="[K8s] received secret to process that was older than the last secret operated on. (369730905 vs 369730954)"
Apr 10 13:11:35 <nodename> rancher-system-agent[50346]: time="2023-04-10T13:11:35Z" level=error msg="error syncing 'fleet-default/custom-6ab162c666c9-machine-plan': handler secret-watch: secret received was too old, requeuing"

with crictl, I can tell that both

kube-controller-manager-<node>

and

kube-scheduler-<node>

are Running In kubelet log there are several errors talink about

failed to sync secret cache: timed out waiting for the condition

or like this one :

Copy code

E0410 22:34:51.109948    1396 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"kube-controller-manager\" with CrashLoopBackOff: \"back-off 10s restarting failed container=kube-controller-manager pod=kube-controller-manager-<node-name>_kube-system(57585a0305e4e46df816ebab263926f3)\"" pod="kube-system/kube-controller-manager-<node-name>" podUID=57585a0305e4e46df816ebab263926f3

jolly-processor-88759

04/11/2023, 1:36 PM

Go check the kube-controller-manager logs next. Also since rancher-system-agent is saying secret too old, restart rancher-system-agent

jolly-processor-88759

04/11/2023, 1:37 PM

also add these logs to the github issue

jolly-processor-88759

04/11/2023, 1:39 PM

It sounds like you may have a config.yaml error that you are passing down from Rancher. Did you have a feature flag on or kube-controller args that are now deprecated?

tall-translator-73410

04/11/2023, 1:41 PM

There is no error in kube-controller-manager container logs

tall-translator-73410

04/11/2023, 1:42 PM

I double check cluster config, but I'm quite sure that we have used default parameters (we just removed ingress controller from main screen)

tall-translator-73410

04/11/2023, 1:42 PM

Note that they are all clusters created hundreds of days ago

jolly-processor-88759

04/11/2023, 1:42 PM

you might want to grep kubelet.log for "error"

jolly-processor-88759

04/11/2023, 1:43 PM

but this does really look like a rancher-system-agent issue

tall-translator-73410

04/11/2023, 1:43 PM

Yes, it is empty , the only error is about a cert-manager solver that is okay to fail :

Copy code

# crictl logs 74fa73b170537 2>&1 |grep -i error
I0411 06:36:59.217837       1 event.go:294] "Event occurred" object="<namespace>/cm-acme-http-solver-v2cq5" fieldPath="" kind="Endpoints" apiVersion="v1" type="Warning" reason="FailedToUpdateEndpoint" message="Failed to update endpoint <namespace>/cm-acme-http-solver-v2cq5: Operation cannot be fulfilled on endpoints \"cm-acme-http-solver-v2cq5\": StorageError: invalid object, Code: 4, Key: /registry/services/endpoints/<namespace>/cm-acme-http-solver-v2cq5, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: 7eeee381-5e72-4a7e-a24e-1089b7d40156, UID in object meta: "

jolly-processor-88759

04/11/2023, 1:45 PM

check rke2-server's logs

tall-translator-73410

04/11/2023, 1:45 PM

I think it's just a wrong probe config problem. both controller and sheduler are acting good in the cluster, do you know where rancher-system-agent take its config from ?

jolly-processor-88759

04/11/2023, 1:45 PM

https://github.com/rancher/rancher/issues/38229

jolly-processor-88759

04/11/2023, 1:46 PM

That's not the best ticket haha

tall-translator-73410

04/11/2023, 1:48 PM

There is much more errors in rke2-server logs

jolly-processor-88759

04/11/2023, 1:52 PM

You can check rancher logs next

jolly-processor-88759

04/11/2023, 1:53 PM

Rancher-system-agent is just the client side of the downstream provisioner in Rancher. so its likely translating the machine config and instructions from the API and calling it a secret, but i haven't nailed it down yet

tall-translator-73410

04/11/2023, 1:53 PM

Here are rke2-server logs that occurs more than once :

Copy code

# journalctl -u rke2-server -n 4000 |grep -i error |cut -c 27- |sed -e 's/2023-[^Z]*Z/TIMEREDACTED"/' |sort |uniq -c|sort -n |grep -v "^      1"
      3 rke2[877]: time="TIMEREDACTED"" level=info msg="error in remotedialer server [400]: websocket: close 1006 (abnormal closure): unexpected EOF"
      4 rke2[366270]: time="TIMEREDACTED"" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: <https://127.0.0.1:9345/v1-rke2/readyz>: 500 Internal Server Error"
      4 rke2[367289]: time="TIMEREDACTED"" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: <https://127.0.0.1:9345/v1-rke2/readyz>: 500 Internal Server Error"
      4 rke2[877]: time="TIMEREDACTED"" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: <https://127.0.0.1:9345/v1-rke2/readyz>: 500 Internal Server Error"
      5 rke2[877]: time="TIMEREDACTED"" level=warning msg="Proxy error: write failed: io: read/write on closed pipe"
      9 rke2[387056]: time="TIMEREDACTED"" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: <https://127.0.0.1:9345/v1-rke2/readyz>: 500 Internal Server Error"
     11 rke2[387056]: time="TIMEREDACTED"" level=info msg="error in remotedialer server [400]: websocket: close 1006 (abnormal closure): unexpected EOF"
     15 rke2[912]: time="TIMEREDACTED"" level=info msg="error in remotedialer server [400]: websocket: close 1006 (abnormal closure): unexpected EOF"
    162 rke2[912]: time="TIMEREDACTED"" level=warning msg="Proxy error: write failed: io: read/write on closed pipe"

tall-translator-73410

04/11/2023, 1:56 PM

I saw that rancher-system-agent was referring to a secret in rancher cluster, namespace

fleet-default

,secret

custom-<nodeid>-machine-plan

tall-translator-73410

04/11/2023, 1:57 PM

But this secret has several keys, and I don't know really where to watch to confirm secret "outdated" state :

Copy code

-> applied-checksum
-> appliedPlan
-> failed-checksum
-> failed-output
-> failure-count
-> last-apply-time
-> plan
-> probe-statuses
-> success-count
-> applied-output
-> applied-periodic-output
-> failure-threshold
-> max-failures

jolly-processor-88759

04/11/2023, 2:04 PM

So in that secret there is a probe-statuses at the bottom I imagine this is where it shows kube-scheduler/kube-controller being not ready?

tall-translator-73410

04/11/2023, 2:04 PM

yes !

tall-translator-73410

04/11/2023, 2:05 PM

Copy code

{
  "calico": {
    "healthy": true,
    "successCount": 1
  },
  "etcd": {
    "healthy": true,
    "successCount": 1
  },
  "kube-apiserver": {
    "healthy": true,
    "successCount": 1
  },
  "kube-controller-manager": {
    "failureCount": 2
  },
  "kube-scheduler": {
    "failureCount": 2
  },
  "kubelet": {
    "healthy": true,
    "successCount": 1
  }
}

tall-translator-73410

04/11/2023, 2:05 PM

This is the content of the secret key

probe-statuses

jolly-processor-88759

04/11/2023, 2:05 PM

so in that plan json you cans ee the current cloud-init and config.yaml files being applied

jolly-processor-88759

04/11/2023, 2:05 PM

the config.yaml file for downstream provisioned will be the one with the path "/etc/rancher/rke2/config.yaml.d/50-rancher.yaml"

jolly-processor-88759

04/11/2023, 2:06 PM

you can also see the last applied, and compare the two maybe? if they aren't updated to the same already

jolly-processor-88759

04/11/2023, 2:06 PM

plan

and

appliedPlan

tall-translator-73410

04/11/2023, 2:11 PM

I'm going to check that (just fixing a prod issue, not related to this 🙂 )

👍 1

tall-translator-73410

04/11/2023, 2:20 PM

plan

key and

appliedPlan

are stricly the same data

tall-translator-73410

04/11/2023, 2:24 PM

Very interesting, the key "plan" contains a json with two main keys : files and probes. I'll try to manually check kube-controller probe

tall-translator-73410

04/11/2023, 2:27 PM

Copy code

# k view-secret -n fleet-default custom-<nodeid>-machine-plan plan |jq '.probes."kube-controller-manager"'
{
  "initialDelaySeconds": 1,
  "timeoutSeconds": 5,
  "successThreshold": 1,
  "failureThreshold": 2,
  "httpGet": {
    "url": "<https://127.0.0.1:10257/healthz>",
    "caCert": "/var/lib/rancher/rke2/server/tls/kube-controller-manager/kube-controller-manager.crt"
  }
}

By calling the URL with curl ignoring cert, it's kube-controller-manager answers "ok" :

Copy code

# curl -k <https://127.0.0.1:10257/healthz>
ok

And when using the expected CAcert :

Copy code

# curl --cacert /var/lib/rancher/rke2/server/tls/kube-controller-manager/kube-controller-manager.crt <https://127.0.0.1:10257/healthz>
curl: (60) SSL certificate problem: certificate has expired
More details here: <https://curl.haxx.se/docs/sslcerts.html>

curl failed to verify the legitimacy of the server and therefore could not
establish a secure connection to it. To learn more about this situation and
how to fix it, please visit the web page mentioned above.

So it seems to be a certificate problem

jolly-processor-88759

04/11/2023, 2:30 PM

Usually that shows up in the logs we checked before. hrm and you said only 100 days, i think the default is 365 on certs

jolly-processor-88759

04/11/2023, 2:30 PM

*openssl x509 -text -in* /var/lib/rancher/rke2/server/tls/kube-controller-manager/kube-controller-manager.crt *| grep -A 2 Validity*

jolly-processor-88759

04/11/2023, 2:30 PM

that should get you the dates of the cert

tall-translator-73410

04/11/2023, 2:33 PM

I check it right now, (I was cross-posting results in the issue)

tall-translator-73410

04/11/2023, 2:33 PM

It is expired since a looooooooong time (4 months) 😄

Copy code

# openssl x509 -text -in /var/lib/rancher/rke2/server/tls/kube-controller-manager/kube-controller-manager.crt | grep -A 2 Validity
        Validity
            Not Before: Dec 29 13:14:28 2021 GMT
            Not After : Dec 29 13:14:28 2022 GMT

💯 1

tall-translator-73410

04/11/2023, 2:35 PM

So maybe try to renew certificates from Rancher UI ?

jolly-processor-88759

04/11/2023, 2:35 PM

you guess is as good as mine 😄

jolly-processor-88759

04/11/2023, 2:36 PM

not sure if it will do it while the cluster is in provisioning like this

jolly-processor-88759

04/11/2023, 2:36 PM

@creamy-pencil-82913 Can you advise?

tall-translator-73410

04/11/2023, 2:36 PM

(First of all, thanks a lot for your help, even if it's not the real root cause, I d'ont want to wait for the resolution to thank you 🙂)

jolly-processor-88759

04/11/2023, 2:37 PM

You're welcome. Hopefully Brad can give you an quick answer, Might throw that question in general on how to fix the certs to see if anyone else has experience

jolly-processor-88759

04/11/2023, 2:38 PM

Its always bothered me that RKE2 doesn't self manage the certs

tall-translator-73410

04/11/2023, 2:39 PM

Yes, as fleet seems to be stuck, I'm not sure that clicking on "rotate certificate" will actually do the job ... but maybe, let's wait for Rancher's team feedback

tall-translator-73410

04/11/2023, 2:40 PM

We already had cert issue in the past on RKE1, but I was wuite sure that RKE2 was rotating certs at each upgrade, maybe I was wrong ...

jolly-processor-88759

04/11/2023, 2:40 PM

If you find out 😄 let me know so i can put in a maintenance plan for our clusters hahaha

tall-translator-73410

04/11/2023, 2:40 PM

haha 😄

jolly-processor-88759

04/11/2023, 2:41 PM

i just check ours and they have rolled once already on our oldest cluster. Maybe during RKE2 upgrades it does it? not sure the last time you upgraded

tall-translator-73410

04/11/2023, 2:42 PM

maybe only during major upgrades ?

jolly-processor-88759

04/11/2023, 2:42 PM

1.22 => 1.23 (Minor) maybe

tall-translator-73410

04/11/2023, 2:42 PM

yeah ... my bad ... when I say "major" in k8s it's minor in fact 😄

jolly-processor-88759

04/11/2023, 2:43 PM

yep, k8s (1) will never change hahaha

jolly-processor-88759

04/11/2023, 2:43 PM

saw a meme about that the otherday

tall-translator-73410

04/11/2023, 2:43 PM

I quite consider that "1.24" is the major number 😄 and ".11" the minor ^^

jolly-processor-88759

04/11/2023, 2:43 PM

.11 is the patch

tall-translator-73410

04/11/2023, 2:43 PM

yeah ^^

jolly-processor-88759

04/11/2023, 2:44 PM

https://twitter.com/memenetes/status/1631339227785969664

tall-translator-73410

04/11/2023, 2:44 PM

haha ! excellent 😄

tall-translator-73410

04/11/2023, 2:51 PM

So it seems that the fix for

Waiting for probes: kube-controller-manager, kube-scheduler

issue is to force a certificate rotation of these services, but as upgrade is stuck, clicking on "Rotate certificates" in Rancher UI does nothing. Does anyone know how to trigger cert rotation in that case ?

tall-translator-73410

04/11/2023, 2:51 PM

I try to cross-post to #general, this thread is so long that no one will ever read all this.

🙌 1

creamy-pencil-82913

04/11/2023, 4:43 PM

This sounds like an issue with the plan controller on the rancher side that causes it to constantly thrash the plan under certain conditions, which in turn causes rancher-system-agent to constantly restart rke2. This should be fixed in 2.7.2 which will release quite soon.

tall-translator-73410

04/11/2023, 4:45 PM

That sounds great ! I saw that rc10 just get out earlier, do you have any hint about release date ?

creamy-pencil-82913

04/11/2023, 4:46 PM

soon is the most I can say.

tall-translator-73410

04/11/2023, 4:50 PM

Ok thanks !

creamy-pencil-82913

04/12/2023, 5:05 AM

it’s out

tall-translator-73410

04/12/2023, 8:27 AM

Hi @creamy-pencil-82913! thanks 🙂

tall-translator-73410

04/12/2023, 8:28 AM

I just upgraded one of rancher to 2.7.2, is here anything to do to fix the cluster state, or just wait until it eventually auto-fix the problem ?

adventurous-magazine-13224

04/12/2023, 12:40 PM

I upgraded to 2.7.2 today, and thought that caused the same issue... just so happens that the cluster I'm seeing this issue on is 365 days old today! 😛

adventurous-magazine-13224

04/12/2023, 12:40 PM

Copy code

sudo openssl x509 -text -in /var/lib/rancher/rke2/server/tls/kube-controller-manager/kube-controller-manager.crt | grep -A 2 Validity
        Validity
            Not Before: Apr 12 09:18:14 2022 GMT
            Not After : Apr 12 09:18:14 2023 GMT

tall-translator-73410

04/12/2023, 12:42 PM

Unlucky ! But it does confirm that the problem is due to cert expiration

tall-translator-73410

04/12/2023, 12:42 PM

it seems that 2.7.2 does not fix the

Waiting for probes: kube-controller-manager, kube-scheduler

problem

busy-flag-55906

04/12/2023, 12:44 PM

sad news, i was hoping for positive comments here. Indeed my ssls have been expired as well

tall-translator-73410

04/12/2023, 12:45 PM

Your feedback is important @busy-flag-55906 🙂 It confirms that our problem is due to this cert expiration !

tall-translator-73410

04/12/2023, 12:46 PM

If we find how to fix it, lot of people will benefits of this 🙂

tall-translator-73410

04/12/2023, 12:46 PM

I tried to force certificate rotation from Rancher UI, but it did nothing, I suppose due to the fact that the cluster is waiting to apply upgrade plan first

busy-flag-55906

04/12/2023, 12:49 PM

yes, i tried this as well with no luck

adventurous-magazine-13224

04/12/2023, 12:51 PM

I've just tried this too, but still no good 😞

creamy-pencil-82913

04/12/2023, 3:49 PM

Those certs are created by the controllers themselves, they’re not managed by either rancher or rke2. I’ll have to see how they can be renewed.

tall-translator-73410

04/12/2023, 3:59 PM

Thanks @creamy-pencil-82913! It would be awesome to unlock this situation 🙂

creamy-pencil-82913

04/12/2023, 9:01 PM

yeah, you can’t delete the shadow pods. You would need to delete the scheduler and controller-manager manifests from /var/lib/rancher/rke2/agent/pod-manifests/ and then restart rke2-server

wide-receptionist-90874

04/12/2023, 9:01 PM

I mean you could probably even

pskill kube-controller-manager

and

pskill kube-scheduler

maybe?

creamy-pencil-82913

04/12/2023, 9:01 PM

or use crictl to delete them

creamy-pencil-82913

04/12/2023, 9:02 PM

yeah any of those should work

wide-receptionist-90874

04/12/2023, 9:02 PM

I'm going to file an issue in rancher/rancher around this so we can fix this properly... this is no bueno

💯 1

👍 1

tall-translator-73410

04/13/2023, 9:01 AM

Deleting cert manually + force container restart with crictl is working !

tall-translator-73410

04/13/2023, 9:03 AM

Here is a shell command to check if probes are okay or not :

Copy code

(
curl  --cacert /var/lib/rancher/rke2/server/tls/kube-controller-manager/kube-controller-manager.crt \
  <https://127.0.0.1:10257/healthz> >/dev/null 2>&1 \
  && echo "[OK] Kube Controller probe" \
  || echo "[FAIL] Kube Controller probe";

curl --cacert /var/lib/rancher/rke2/server/tls/kube-scheduler/kube-scheduler.crt \
  <https://127.0.0.1:10259/healthz> >/dev/null 2>&1  \
  && echo "[OK] Scheduler probe" \
  || echo "[FAIL] Scheduler probe";
)

And below commands I used to force certificate rotation on failed probes :

Copy code

echo "Rotating kube-controller-manager certificate"
rm /var/lib/rancher/rke2/server/tls/kube-controller-manager/kube-controller-manager.{crt,key}
crictl rm -f $(crictl ps -q --name kube-controller-manager)

echo "Rotating kube-scheduler certificate"
rm /var/lib/rancher/rke2/server/tls/kube-scheduler/kube-scheduler.{crt,key}
crictl rm -f $(crictl ps -q --name kube-scheduler)

tall-translator-73410

04/13/2023, 9:07 AM

Thanks a lot @wide-receptionist-90874 and @creamy-pencil-82913! (FYI I already opened an issue on this problem : https://github.com/rancher/rancher/issues/41125)

creamy-wolf-46823

04/13/2023, 9:42 AM

You might consider changing the title of your issue now that the problem is better understood. An alternative might be: "Improve Rancher management of RKE2 clusters to handle expired kube-controller-manager and kube-scheduler certs".

tall-translator-73410

04/13/2023, 10:29 AM

If you have the problem, would you more likely find this issue with the previous title or the one you suggest ?

tall-translator-73410

04/13/2023, 10:31 AM

To me, your title may be a good one for a new "feature request" issue, not to give workaround and unlock impacted clusters, but maybe I'm wrong

jolly-processor-88759

04/13/2023, 1:12 PM

Great work @tall-translator-73410 glad it all go figured out!

🙇‍♂️ 2

tall-translator-73410

04/13/2023, 1:14 PM

Glad you helped 😁 👍

adventurous-magazine-13224

04/13/2023, 1:15 PM

Thanks everyone! 😄 This sorted out the problem quickly with a couple of our clusters! 🥳

👍 1

🦜 1

busy-flag-55906

04/13/2023, 1:21 PM

thanks, i was able to fix the issue as well on the cluster

🦜 1

👍 1

jolly-processor-88759

04/13/2023, 1:53 PM

I noticed that a standard RKE2 cluster (non downstream provisioned) appears to have different certificate locations that the downstream clusters. Anyone know if the fix steps works on that as well?

wide-receptionist-90874

04/13/2023, 4:24 PM

@tall-translator-73410 I find that I try to add as many relevant logs in the body of the issue, as it gets indexed by Google and people will tend to find issues that way. Thank you for filing that issue! I'll get it labeled/assigned/scheduled to fix. Currently I'm actually thinking we might try to implement a fix for this in a manner that allows an RKE2/K3s upgrade to fix it.

💯 3

🦜 1

best-microphone-20624

04/13/2023, 5:03 PM

@tall-translator-73410 The current title for the issue can have many root causes including some that were indeed already fixed in 2.7.2. Just wanted to make sure the issue gets visibility within Rancher folks like @wide-receptionist-90874 and @creamy-pencil-82913 which already seems to be the case. Thanks for your troubleshooting efforts and sharing those with the community!

wide-receptionist-90874

04/13/2023, 7:43 PM

@jolly-processor-88759 yes, the action of deleting the certificate files + kicking the component should cause a new cert to be generated.

jolly-processor-88759

04/14/2023, 12:16 PM

@wide-receptionist-90874 Any idea why the

/var/lib/rancher/rke2/server/tls/

kube-controller-manager

and

kube-scheduler

folders do not exists on a non-downstream provisioned RKE2 cluster? I can see it on all of our down stream provisioned systems but not our MCM cluster.

creamy-pencil-82913

04/15/2023, 12:09 AM

They're only present on provisioned clusters, for health checking by the rancher agent. The rancher agent configures them, they're not present by default on rke2.

1037 Views

Open in Slack

Previous Next