This message was deleted Rancher Users #rke2

Join Slack

This message was deleted.

# rke2

adamant-kite-43734

11/01/2023, 8:15 AM

This message was deleted.

gifted-cricket-25537

11/01/2023, 8:50 AM

Copy code

curl -H "rke2-Node-Name: <hostname>" -H "rke2-Node-Password: <password>" <https://192.168.0.48:9345/v1-rke2/serving-kubelet.crt> -k --cert /var/lib/rancher/rke2/agent/client-kubelet.crt --key /var/lib/rancher/rke2/agent/client-kubelet.key -vvv
*   Trying 192.168.0.48:9345...
* TCP_NODELAY set
* Connected to 192.168.0.48 (192.168.0.48) port 9345 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: /etc/ssl/certs
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, Request CERT (13):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, Certificate (11):
* TLSv1.3 (OUT), TLS handshake, CERT verify (15):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256
* ALPN, server accepted to use h2
* Server certificate:
*  subject: O=rke2; CN=rke2
*  start date: Oct 19 11:35:18 2023 GMT
*  expire date: Oct 18 11:37:59 2024 GMT
*  issuer: CN=rke2-server-ca@1697715318
*  SSL certificate verify result: self signed certificate in certificate chain (19), continuing anyway.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0x55e9935ee300)
> GET /v1-rke2/serving-kubelet.crt HTTP/2
> Host: 192.168.0.48:9345
> user-agent: curl/7.68.0
> accept: */*
> rke2-node-name: <hostname>
> rke2-node-password: <password>
>
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* Connection state changed (MAX_CONCURRENT_STREAMS == 250)!

gifted-cricket-25537

11/01/2023, 8:51 AM

and it hangs, although the

/readyz

endpoint is ok

gifted-cricket-25537

11/01/2023, 8:53 AM

unrelated node, but on the other side I see: Nov 01 075505 <controlplane> rke2[3525]: I1101 075505.665245 3525 request.go:690] Waited for 10h11m15.660255046s due to client-side throttling, not priority and fairness, request: GET:https://127.0.0.1:6443/api/v1/namespaces/kube-system/secrets/<hostname>.node-password.rke2

gifted-cricket-25537

11/01/2023, 8:53 AM

10h throttle 😄

gifted-cricket-25537

11/01/2023, 8:53 AM

would be nice to know how to tune the client kube-api qps here

gifted-cricket-25537

11/01/2023, 8:57 AM

this is clearly a concurrency bug as after restarting the rke2-server it starts to reply immediately

gifted-cricket-25537

11/01/2023, 8:58 AM

@creamy-pencil-82913 wdyt?

gifted-cricket-25537

11/01/2023, 9:02 AM

I have 320 nodes in the cluster and tried to big-bang restart. v1.25.14+rke2r1 and Rancher 2.7.6

gifted-cricket-25537

11/01/2023, 10:49 AM

another occasion, this time, if I direct the request to the apiserver, it works, but hangs with the local loadbalancer:

Copy code

curl -H "rke2-Node-Name: <hostname>" -H "rke2-Node-Password: <password>" <https://apiserver:9345/v1-rke2/serving-kubelet.crt> -k --cert /var/lib/rancher/rke2/agent/client-kubelet.crt --key /var/lib/rancher/rke2/agent/client-kubelet.key -vvv
*   Trying 192.168.0.48:9345...
* TCP_NODELAY set
* Connected to 192.168.0.48 (192.168.0.48) port 9345 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: /etc/ssl/certs
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, Request CERT (13):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, Certificate (11):
* TLSv1.3 (OUT), TLS handshake, CERT verify (15):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256
* ALPN, server accepted to use h2
* Server certificate:
*  subject: O=rke2; CN=rke2
*  start date: Oct 19 11:35:18 2023 GMT
*  expire date: Oct 18 11:37:59 2024 GMT
*  issuer: CN=rke2-server-ca@1697715318
*  SSL certificate verify result: self signed certificate in certificate chain (19), continuing anyway.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0x56372e206300)
> GET /v1-rke2/serving-kubelet.crt HTTP/2
> Host: 192.168.0.48:9345
> user-agent: curl/7.68.0
> accept: */*
> rke2-node-name: <hostname>
> rke2-node-password: <password>
>
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* Connection state changed (MAX_CONCURRENT_STREAMS == 250)!
< HTTP/2 200
< content-type: text/plain; charset=utf-8
< content-length: 1506
< date: Wed, 01 Nov 2023 10:46:24 GMT
<
-----BEGIN CERTIFICATE-----
...

gifted-cricket-25537

11/01/2023, 10:49 AM

Copy code

# curl -H "rke2-Node-Name: <hostname>" -H "rke2-Node-Password: <password>" <https://127.0.0.1:6444/v1-rke2/serving-kubelet.crt> -k --cert /var/lib/rancher/rke2/agent/client-kubelet.crt --key /var/lib/rancher/rke2/agent/client-kubelet.key -vvv
*   Trying 127.0.0.1:6444...
* TCP_NODELAY set
* Connected to 127.0.0.1 (127.0.0.1) port 6444 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: /etc/ssl/certs
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, Request CERT (13):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, Certificate (11):
* TLSv1.3 (OUT), TLS handshake, CERT verify (15):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256
* ALPN, server accepted to use h2
* Server certificate:
*  subject: O=rke2; CN=rke2
*  start date: Oct 19 11:35:18 2023 GMT
*  expire date: Oct 18 11:37:59 2024 GMT
*  issuer: CN=rke2-server-ca@1697715318
*  SSL certificate verify result: self signed certificate in certificate chain (19), continuing anyway.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0x55da0c6b7300)
> GET /v1-rke2/serving-kubelet.crt HTTP/2
> Host: 127.0.0.1:6444
> user-agent: curl/7.68.0
> accept: */*
> rke2-node-name: <hostname>
> rke2-node-password: <password>
>
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* Connection state changed (MAX_CONCURRENT_STREAMS == 250)!

gifted-cricket-25537

11/01/2023, 10:53 AM

and that's because the IP added to the

rke2-agent-load-balancer

indeed does not reply, and the one that works was removed as it wasn't working during the startup of the rke2-agent

gifted-cricket-25537

11/01/2023, 10:55 AM

Copy code

msg="Adding server to load balancer rke2-agent-load-balancer: <apiserver 1>:9345"
msg="Adding server to load balancer rke2-agent-load-balancer: <apiserver 3>:9345"
msg="Removing server from load balancer rke2-agent-load-balancer: <apiserver 1>:9345"
msg="Running load balancer rke2-agent-load-balancer 127.0.0.1:6444 -> [<apiserver 3>:9345] [default: <apiserver 1>:9345]"
msg="failed to get CA certs: Get \"<https://127.0.0.1:6444/cacerts>\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"

gifted-cricket-25537

11/01/2023, 11:00 AM

and even with restarting the rke2-agent the same happens, it tries to talk to

apiserver 3

and removes the working

apiserver 1

from the agent LB

creamy-pencil-82913

11/01/2023, 4:24 PM

The load-balancer list is populated from the list of endpoints for the Kubernetes service in the default namespace:

kubectl get endpoints kubernetes

gifted-cricket-25537

11/01/2023, 4:38 PM

yeah, since I restarted all nodes at once - to test this scenario - there was only one node active - I assume, but new nodes were never added as they came up

gifted-cricket-25537

11/01/2023, 4:40 PM

Also raised this https://github.com/rancher/rke2/issues/4975 to summarize my findings

creamy-pencil-82913

11/01/2023, 5:21 PM

You’d need to check that endpoint list to confirm that it’s doing the right thing. I suspect that some of the apiservers are getting overloaded with the rush of clients, which causes the issues you’re seeing - the excessive client-side throttling, and removal of servers from the endpoint list. You might check to see if some of the apiserver static pods are crashing and restarting?

creamy-pencil-82913

11/01/2023, 5:22 PM

the load-balancer switches servers whenever a backend cannot be dialed successfully, it doesn’t know anything about the state of requests going through the connection. If it is able to connect, but the resulting requests hang, it will never switch backends.

gifted-cricket-25537

11/01/2023, 5:25 PM

for the latter, I get it, although in the end it's misleading for the agent as it can't perform the action it wants and retries the same apiserver again and again without luck

gifted-cricket-25537

11/01/2023, 5:25 PM

let me do another round of restart and show the endpoint status

gifted-cricket-25537

11/01/2023, 5:31 PM

the apiserver pods btw are stable, I have in total 36 cores for the apiservers (+controller managers), so I would assume that should be enough 🙂

gifted-cricket-25537

11/01/2023, 5:41 PM

Now containerd wasn't started on etcd nodes.

Copy code

Nov 01 17:36:37 production-island10-overcloud-etcd-2 rke2[708]: time="2023-11-01T17:36:37Z" level=info msg="Waiting for cri connection: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial unix /run/k3s/containerd/containerd.sock: connect: no such file or directory\""

even after restarting the rke2-server on this etcd node, I don't see containerd coming up

gifted-cricket-25537

11/01/2023, 5:44 PM

etcd-1 - no etcd-2 - no etcd-3 - yes etcd-4 - no etcd-5 - yes so I don't have quorum

gifted-cricket-25537

11/01/2023, 5:44 PM

tried rebooting etcd-2 no luck

gifted-cricket-25537

11/01/2023, 5:46 PM

if I'd know why containerd can't come up...

gifted-cricket-25537

11/01/2023, 5:47 PM

I wonder if it'd be better to run my own containerd instead of the embedded one

gifted-cricket-25537

11/01/2023, 6:10 PM

cluster is bricked 😕

gifted-cricket-25537

11/01/2023, 6:20 PM

lol, it took a looong time, but at least I have quorum:

Copy code

sh-4.4# etcdctl --cacert="/var/lib/rancher/rke2/server/tls/etcd/server-ca.crt" --cert="/var/lib/rancher/rke2/server/tls/etcd/server-client.crt" --key="/var/lib/rancher/rke2/server/tls/etcd/server-client.key" --endpoints="<https://192.168.0.16:2379>,<https://192.168.0.17:2379>,<https://192.168.0.18:2379><https://192.168.0.19:2379>,<https://192.168.0.20:2379>" endpoint status --write-out table
{"level":"warn","ts":"2023-11-01T18:20:04.314357Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"<etcd-endpoints://0xc0003c8380/192.168.0.16:2379>","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp: address 192.168.0.18:2379https:: too many colons in address\""}
Failed to get the status of endpoint <https://192.168.0.18:2379><https://192.168.0.19:2379> (context deadline exceeded)
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|         ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| <https://192.168.0.16:2379> | 90854c1b43df1af2 |   3.5.9 |   82 MB |     false |      false |         6 |   54149274 |           54149274 |        |
| <https://192.168.0.17:2379> | 6336293dad683176 |   3.5.9 |   82 MB |     false |      false |         6 |   54149274 |           54149274 |        |
| <https://192.168.0.20:2379> |  6df555c9f60573b |   3.5.9 |  394 MB |      true |      false |         6 |   54149345 |           54149345 |        |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

gifted-cricket-25537

11/01/2023, 6:21 PM

so endpoints:

Copy code

kubectl get endpoints kubernetes
NAME         ENDPOINTS                                               AGE
kubernetes   192.168.0.48:6443,192.168.0.49:6443,192.168.0.50:6443   13d

creamy-pencil-82913

11/01/2023, 6:22 PM

you might check the containerd logs on that node if it ever fails to come up again - that usually indicates that containerd is stalled checking something. I’ve never seen it take more than a few seconds though.

gifted-cricket-25537

11/01/2023, 6:27 PM

yeah me neither, it was strange

gifted-cricket-25537

11/01/2023, 6:31 PM

so back to the endpoints issue, now controlplane-3 churns, the other two chills:

Copy code

Nov 01 18:30:23 production-island10-overcloud-controlplane-3 rke2[5989]: time="2023-11-01T18:30:23Z" level=info msg="certificate CN=production-island10-overcloud-worker-95 signed by CN=rke2-server-ca@1697715318: notBefore=2023-10-19 11:35:18 +0000 UTC notAfter=2024-10-31 18:30:23 +0000 UTC"
Nov 01 18:30:24 production-island10-overcloud-controlplane-3 rke2[5989]: time="2023-11-01T18:30:24Z" level=info msg="certificate CN=production-island10-overcloud-worker-190 signed by CN=rke2-server-ca@1697715318: notBefore=2023-10-19 11:35:18 +0000 UTC notAfter=2024-10-31 18:30:24 +0000 UTC"
Nov 01 18:30:24 production-island10-overcloud-controlplane-3 rke2[5989]: time="2023-11-01T18:30:24Z" level=info msg="certificate CN=production-island10-overcloud-worker-76 signed by CN=rke2-server-ca@1697715318: notBefore=2023-10-19 11:35:18 +0000 UTC notAfter=2024-10-31 18:30:24 +0000 UTC"

this goes constantly for all 281 NotReady nodes I have now

gifted-cricket-25537

11/01/2023, 6:34 PM

logs from a

NotReady

node:

Copy code

Nov 01 17:30:14 production-island10-overcloud-worker-99 rke2[927]: time="2023-11-01T17:30:14Z" level=info msg="Adding server to load balancer rke2-agent-load-balancer: 192.168.0.48:9345"
Nov 01 17:30:14 production-island10-overcloud-worker-99 rke2[927]: time="2023-11-01T17:30:14Z" level=info msg="Adding server to load balancer rke2-agent-load-balancer: 192.168.0.50:9345"
Nov 01 17:30:14 production-island10-overcloud-worker-99 rke2[927]: time="2023-11-01T17:30:14Z" level=info msg="Removing server from load balancer rke2-agent-load-balancer: 192.168.0.48:9345"
Nov 01 17:30:14 production-island10-overcloud-worker-99 rke2[927]: time="2023-11-01T17:30:14Z" level=info msg="Running load balancer rke2-agent-load-balancer 127.0.0.1:6444 -> [192.168.0.50:9345] [default: 192.168.0.48:9345]"

gifted-cricket-25537

11/01/2023, 6:35 PM

Copy code

192.168.0.50 is production-island10-overcloud-controlplane-3

gifted-cricket-25537

11/01/2023, 6:35 PM

k8 service endpoints are stable, no changes there, all three added

gifted-cricket-25537

11/01/2023, 6:36 PM

I restart the rke2-agent on

production-island10-overcloud-worker-99

gifted-cricket-25537

11/01/2023, 6:38 PM

Copy code

Nov 01 18:37:59 production-island10-overcloud-worker-99 rke2[1163]: time="2023-11-01T18:37:59Z" level=info msg="Starting rke2 agent v1.25.14+rke2r1 (36d7417e024e8dad34ebbf94b210ab3dd0f52cd7)"
Nov 01 18:37:59 production-island10-overcloud-worker-99 rke2[1163]: time="2023-11-01T18:37:59Z" level=info msg="Adding server to load balancer rke2-agent-load-balancer: 192.168.0.48:9345"
Nov 01 18:37:59 production-island10-overcloud-worker-99 rke2[1163]: time="2023-11-01T18:37:59Z" level=info msg="Adding server to load balancer rke2-agent-load-balancer: 192.168.0.50:9345"
Nov 01 18:37:59 production-island10-overcloud-worker-99 rke2[1163]: time="2023-11-01T18:37:59Z" level=info msg="Removing server from load balancer rke2-agent-load-balancer: 192.168.0.48:9345"
Nov 01 18:37:59 production-island10-overcloud-worker-99 rke2[1163]: time="2023-11-01T18:37:59Z" level=info msg="Running load balancer rke2-agent-load-balancer 127.0.0.1:6444 -> [192.168.0.50:9345] [default: 192.168.0.48:9345]"
Nov 01 18:37:59 production-island10-overcloud-worker-99 rke2[1163]: time="2023-11-01T18:37:59Z" level=warning msg="Cluster CA certificate is not trusted by the host CA bundle, but the token does not include a CA hash. Use the full token from the server's node-token file to enable Cluster CA validation."
Nov 01 18:38:00 production-island10-overcloud-worker-99 rke2[1163]: time="2023-11-01T18:38:00Z" level=info msg="Adding server to load balancer rke2-api-server-agent-load-balancer: 192.168.0.48:6443"
Nov 01 18:38:00 production-island10-overcloud-worker-99 rke2[1163]: time="2023-11-01T18:38:00Z" level=info msg="Adding server to load balancer rke2-api-server-agent-load-balancer: 192.168.0.50:6443"
Nov 01 18:38:00 production-island10-overcloud-worker-99 rke2[1163]: time="2023-11-01T18:38:00Z" level=info msg="Removing server from load balancer rke2-api-server-agent-load-balancer: 192.168.0.48:6443"
Nov 01 18:38:00 production-island10-overcloud-worker-99 rke2[1163]: time="2023-11-01T18:38:00Z" level=info msg="Running load balancer rke2-api-server-agent-load-balancer 127.0.0.1:6443 -> [192.168.0.50:6443] [default: 192.168.0.48:6443]"
Nov 01 18:38:11 production-island10-overcloud-worker-99 rke2[1163]: time="2023-11-01T18:38:11Z" level=info msg="Waiting to retrieve agent configuration; server is not ready: Get \"<https://127.0.0.1:6444/v1-rke2/serving-kubelet.crt>\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"

gifted-cricket-25537

11/01/2023, 6:39 PM

the same as I said, it goes to the same controlplane node as before nevertheless

gifted-cricket-25537

11/01/2023, 6:39 PM

Copy code

kube-apiserver-production-island10-overcloud-controlplane-1             1/1     Running       24 (39m ago)    8d
kube-apiserver-production-island10-overcloud-controlplane-2             1/1     Running       24 (40m ago)    8d
kube-apiserver-production-island10-overcloud-controlplane-3             1/1     Running       17 (39m ago)    8d

apiservers are up for ~40m

gifted-cricket-25537

11/01/2023, 6:40 PM

if now I restart

rke2-server

production-island10-overcloud-controlplane-3

the other two apiservers will start to receive requests

gifted-cricket-25537

11/01/2023, 6:43 PM

of course after restart, clients started to bombard

Copy code

192.168.0.48 is production-island10-overcloud-controlplane-1

gifted-cricket-25537

11/01/2023, 6:43 PM

but

production-island10-overcloud-worker-99

is still timing out nevertheless

gifted-cricket-25537

11/01/2023, 6:43 PM

Copy code

Nov 01 18:43:33 production-island10-overcloud-worker-99 rke2[1163]: time="2023-11-01T18:43:33Z" level=info msg="Waiting to retrieve agent configuration; server is not ready: Get \"<https://127.0.0.1:6444/v1-rke2/serving-kubelet.crt>\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"

gifted-cricket-25537

11/01/2023, 6:48 PM

but:

Copy code

root@production-island10-overcloud-worker-99:~# curl -H "rke2-Node-Name: production-island10-overcloud-worker-99" -H "rke2-Node-Password: <password>" <https://127.0.0.1:6444/v1-rke2/readyz> -k --cert /var/lib/rancher/rke2/agent/client-kubelet.crt --key /var/lib/rancher/rke2/agent/client-kubelet.key
ok

gifted-cricket-25537

11/01/2023, 6:48 PM

Copy code

root@production-island10-overcloud-worker-99:~# cat /var/lib/rancher/rke2/agent/etc/rke2-agent-load-balancer.json
{
  "ServerURL": "<https://192.168.0.48:9345>",
  "ServerAddresses": [
    "192.168.0.50:9345"
  ],
  "Listener": null
}

creamy-pencil-82913

11/01/2023, 6:48 PM

Why is

192.168.0.50

the only one that it has in the json cache file?

192.168.0.48

is the default (from the --server uri) but I am curious why it only has that one address in the list.

creamy-pencil-82913

11/01/2023, 6:49 PM

Were the other two unavailable at the time you shut this node down? If you look at the logs from the agent prior to the restart, do you see it removing the other two addresses from the list?

gifted-cricket-25537

11/01/2023, 6:50 PM

yes, as I said it was a big bang restart

gifted-cricket-25537

11/01/2023, 6:50 PM

and even after I restarted rke2-agent on worker-99 the list remained the same

creamy-pencil-82913

11/01/2023, 6:51 PM

well yeah it’s not going to be able to update the list until it can talk to the apiserver, and apparently the apiserver on that node that it wants to connect to is overloaded by incoming requests

gifted-cricket-25537

11/01/2023, 6:51 PM

Copy code

root@production-island10-overcloud-worker-99:~# curl -H "rke2-Node-Name: production-island10-overcloud-worker-99" -H "rke2-Node-Password: password<>" <https://192.168.0.48:9345/v1-rke2/serving-kubelet.crt> -k --cert /var/lib/rancher/rke2/agent/client-kubelet.crt --key /var/lib/rancher/rke2/agent/client-kubelet.key
^C

root@production-island10-overcloud-worker-99:~# curl -H "rke2-Node-Name: production-island10-overcloud-worker-99" -H "rke2-Node-Password: <password>" <https://192.168.0.49:9345/v1-rke2/serving-kubelet.crt> -k --cert /var/lib/rancher/rke2/agent/client-kubelet.crt --key /var/lib/rancher/rke2/agent/client-kubelet.key
-----BEGIN CERTIFICATE-----
...
root@production-island10-overcloud-worker-99:~# curl -H "rke2-Node-Name: production-island10-overcloud-worker-99" -H "rke2-Node-Password: <password>" <https://192.168.0.50:9345/v1-rke2/serving-kubelet.crt> -k --cert /var/lib/rancher/rke2/agent/client-kubelet.crt --key /var/lib/rancher/rke2/agent/client-kubelet.key
-----BEGIN CERTIFICATE-----
...

gifted-cricket-25537

11/01/2023, 6:52 PM

so, as now clients swarmed .48 it can't reply now to these requests, but the other two would be happy to answer

gifted-cricket-25537

11/01/2023, 6:53 PM

the agent loadbalancer could WATCH the endpoints and update upon new endpoints appear

creamy-pencil-82913

11/01/2023, 6:53 PM

that is exactly what it does

creamy-pencil-82913

11/01/2023, 6:53 PM

but it can’t do that until it can talk to the apiserver

creamy-pencil-82913

11/01/2023, 6:53 PM

and you’ve overloaded the apiserver with 300+ clients all starting up at once

creamy-pencil-82913

11/01/2023, 6:54 PM

Generally when building a cluster we recommend adding nodes in smaller batches, I would recommend doing the same when coming up from a cold start. If you try to bring them all up at once you’re going to DDOS things

gifted-cricket-25537

11/01/2023, 6:55 PM

Copy code

<https://192.168.0.48:9345/v1-rke2/readyz>

says

ok

gifted-cricket-25537

11/01/2023, 6:55 PM

apiserver is not overloaded, the rke2-server is eating the cpu

gifted-cricket-25537

11/01/2023, 6:56 PM

looks like it constantly regenerates the certs for kubelet

creamy-pencil-82913

11/01/2023, 6:56 PM

All that readyz endpoint indicates is that the apiserver came up and rke2 was able to initialize its core controllers.

gifted-cricket-25537

11/01/2023, 6:56 PM

such as:

Copy code

Nov 01 18:55:23 production-island10-overcloud-controlplane-1 rke2[6447]: time="2023-11-01T18:55:23Z" level=info msg="certificate CN=production-island10-overcloud-worker-99 signed by CN=rke2-server-ca@1697715318: notBefore=2023-10-19 11:35:18 +0000 UTC notAfter=2024-10-31 18:55:23 +0000 UTC"
Nov 01 18:56:14 production-island10-overcloud-controlplane-1 rke2[6447]: time="2023-11-01T18:56:14Z" level=info msg="certificate CN=production-island10-overcloud-worker-99 signed by CN=rke2-server-ca@1697715318: notBefore=2023-10-19 11:35:18 +0000 UTC notAfter=2024-10-31 18:56:14 +0000 UTC"

gifted-cricket-25537

11/01/2023, 6:58 PM

apiserver eats 0.3 cores, while rke2-server eats 11 cores now

creamy-pencil-82913

11/01/2023, 6:59 PM

all that log shows is that the agent on that node requested a cert two times. the agent will request certs every time it starts; if it keeps requesting them over and over again it indicates that the agent is crashing or timing out and restarting

gifted-cricket-25537

11/01/2023, 6:59 PM

my main concern - why I'm testing - is the non-zero chance that at some point the whole cluster goes down and delays coming back up due to inefficiencies here and there

creamy-pencil-82913

11/01/2023, 7:00 PM

right, and my suggestion was that, if that happens, you don’t try to bring up all the agents at the same time. Do them in smaller batches, say 25 or so.

gifted-cricket-25537

11/01/2023, 7:00 PM

I believe the problem is that the agent constantly tries only one apiserver endpoint - despite all 3 are there and working

gifted-cricket-25537

11/01/2023, 7:01 PM

if it'd rotate the endpoints (and backoff potentially) it'd get through in no time

creamy-pencil-82913

11/01/2023, 7:01 PM

it tries only 1 because there was only 1 available when you shut it down. It would update its list to find the other endpoints, but it can’t because that one server (that it would get additional endpoints from) is overloaded.

gifted-cricket-25537

11/01/2023, 7:01 PM

let's see 🙂

Copy code

root@production-island10-overcloud-controlplane-1:~# systemctl restart rke2-server

gifted-cricket-25537

11/01/2023, 7:02 PM

once back up, this control-plance will not be swarmed anymore

creamy-pencil-82913

11/01/2023, 7:04 PM

right so the sequence is: get new certs from existing server -> connect to apiserver -> update endpoints

creamy-pencil-82913

11/01/2023, 7:04 PM

if it can connect to the existing server, but the cert request times out (because that server is overloaded) it will not move past that

creamy-pencil-82913

11/01/2023, 7:05 PM

if you stop the service on the existing server, then it will fail to connect, and try the other server (the default one, from the logs)

creamy-pencil-82913

11/01/2023, 7:06 PM

the underlying cause here is that your shutdown sequence left all the agents with only a single server in their cache, and you’ve started them all up at the same time, and overloaded it.

creamy-pencil-82913

11/01/2023, 7:07 PM

while I agree that we could potentially handle this better, the reality is that starting up hundreds of agents at the same time is never going to work super well. Even when doing our scale testing we are careful to create or restart agents in batches of 25-50.

creamy-pencil-82913

11/01/2023, 7:08 PM

if you’re using rancher or the system-upgrade-controller, it is done even more slowly, in single-digit batches

gifted-cricket-25537

11/01/2023, 7:08 PM

yeah, this is what happened, but this is a hell of an inefficient method unfortunately. Won't it be possible to cache the endpoints for these use-cases? ie use the cached version of client certs (it's already there!) and the cache version of

/var/lib/rancher/rke2/agent/etc/rke2-agent-load-balance.json

and update once the connection comes back?

creamy-pencil-82913

11/01/2023, 7:10 PM

using the old certs 1. wouldn’t work if they’re expired 2. would require us to restart everything again after the certs have been updated, as the core components don’t support live reloading of certificates

gifted-cricket-25537

11/01/2023, 7:10 PM

at least round-robin between the

ServerURL

and whatever is in

ServerAddresses

if there's a non nil err

creamy-pencil-82913

11/01/2023, 7:11 PM

that would be a lot more work to handle. it is a very simple l3 load balancer, as long as the endpoint can be connected to then it is used. as I said above, only if the dial fails does it remove the server from the list and start using a different one.

gifted-cricket-25537

11/01/2023, 7:12 PM

using the old certs, yeah, I get it, but which use-case is the more likely and/or more problematic? 🙂 not being able to start server or the case when you started the server with an expired cert?

gifted-cricket-25537

11/01/2023, 7:12 PM

is it possible for me to disable the agent LB and use something else?

creamy-pencil-82913

11/01/2023, 7:13 PM

creamy-pencil-82913

11/01/2023, 7:13 PM

it is not possible for you to avoid starting all the agents at the exact same time?

gifted-cricket-25537

11/01/2023, 7:14 PM

it is always, but I run 20k customer websites/island and if shit hits the fan all of them will start at the same time 🙂

gifted-cricket-25537

11/01/2023, 7:14 PM

I thought adding a sleep to the nodes upon starting up the rke2-agent

gifted-cricket-25537

11/01/2023, 7:15 PM

don't laugh 😄

Copy code

# /etc/systemd/system/rke2-agent.service.d/10-override.conf
[Service]
ExecStartPre=/bin/bash -xc '/bin/hostname | grep -q worker && /bin/sleep $((RANDOM % 60)) || /bin/true'

gifted-cricket-25537

11/01/2023, 7:15 PM

but meh it hurts my eyes 😕

creamy-pencil-82913

11/01/2023, 7:17 PM

yeah possibly a good workaround for now!

creamy-pencil-82913

11/01/2023, 7:18 PM

We can leave that issue open to track figuring out a way to handle this better - but at the moment just staggering the startups is probably the best way to address it

gifted-cricket-25537

11/01/2023, 7:23 PM

my idea is to check on the worker nodes if the existing client certificate is expired, if not, try to use it to update the endpoints, then if it's a must update the client certificate if it's expired or non existing, then go to apiserver first and then update the endpoints

gifted-cricket-25537

11/01/2023, 7:23 PM

this would be much much cheaper computationally

creamy-pencil-82913

11/01/2023, 7:27 PM

yeah I’m not sure where it’s getting overloaded; I wouldn’t expect generation of certs + keys to be that hard. If you can get a stack trace from the rke2 process when it’s chewing up 12 cores, I’d be interested to see what it’s doing.

gifted-cricket-25537

11/01/2023, 7:27 PM

I have a flamegraph

creamy-pencil-82913

11/01/2023, 7:27 PM

that’d work too

creamy-pencil-82913

11/01/2023, 7:27 PM

I was just going to suggest

kill -ABRT

and then grab the dump from the logs

creamy-pencil-82913

11/01/2023, 7:28 PM

but flame graph is better

creamy-pencil-82913

11/01/2023, 7:29 PM

do you have the full thing? not sure what the scale is here.

creamy-pencil-82913

11/01/2023, 7:31 PM

this bit here just shows that it’s verifying the agent token, and spending a bunch of time in the constant-time password comparison

gifted-cricket-25537

11/01/2023, 7:32 PM

yeah, and I'm afraid that the request.go is throttled as I wrote in my issue and that's the reason why it can't finish

creamy-pencil-82913

11/01/2023, 7:33 PM

have you considered using bootstrap tokens instead of class password style tokens?

creamy-pencil-82913

11/01/2023, 7:34 PM

those are handled differently, I’m curious if you’d see any different behavior

gifted-cricket-25537

11/01/2023, 7:34 PM

hm, I'm using this:

Copy code

${rancher2_cluster_v2.default.cluster_registration_token.0.node_command}

gifted-cricket-25537

11/01/2023, 7:34 PM

in TF

creamy-pencil-82913

11/01/2023, 7:34 PM

yeah that’s the default

creamy-pencil-82913

11/01/2023, 7:34 PM

just using the token

gifted-cricket-25537

11/01/2023, 7:34 PM

so, a default bootstrap password and use that?

creamy-pencil-82913

11/01/2023, 7:35 PM

if you’re using the rancher tf provider for provisioning I don’t think you can change it, it always just uses the server token

creamy-pencil-82913

11/01/2023, 7:35 PM

https://docs.k3s.io/cli/token#token-types

gifted-cricket-25537

11/01/2023, 7:35 PM

kk, checking

gifted-cricket-25537

11/01/2023, 7:37 PM

btw, I restarted rke2-agent on one notready nodes, the

ServerURL

is not doing anything, but still tried the swarmed server and didn't update the endpoints:

Copy code

Nov 01 19:34:27 production-island10-overcloud-worker-96 rke2[1186]: time="2023-11-01T19:34:27Z" level=info msg="Running load balancer rke2-agent-load-balancer 127.0.0.1:6444 -> [192.168.0.50:9345] [default: 192.168.0.48:9345]"

creamy-pencil-82913

11/01/2023, 7:44 PM

it won’t update it until it is able to talk to the apiserver. I can’t remember if it logs it later or not, you might need to check the contents of the cache file, or turn the log level up to debug, if you want to see that later.

gifted-cricket-25537

11/01/2023, 7:45 PM

yeah, I checked the cache and it's still the same

gifted-cricket-25537

11/01/2023, 7:45 PM

only tries to talk to the server in

ServerAddresses

and never tries

ServerURL

instead

creamy-pencil-82913

11/01/2023, 7:46 PM

as long as it can be connected to it won’t switch

creamy-pencil-82913

11/01/2023, 7:46 PM

you’d need to actually stop the rke2 service on the node it’s trying to connect to, to get it to fail over.

gifted-cricket-25537

11/01/2023, 7:47 PM

yeah, I did exactly that

creamy-pencil-82913

11/01/2023, 7:47 PM

are you looking at the apiserver or the supervisor lb?

gifted-cricket-25537

11/01/2023, 7:48 PM

this one for now:

Copy code

# cat /var/lib/rancher/rke2/agent/etc/rke2-agent-load-balancer.json
{
  "ServerURL": "<https://192.168.0.48:9345>",
  "ServerAddresses": [
    "192.168.0.50:9345"
  ],
  "Listener": null

gifted-cricket-25537

11/01/2023, 7:49 PM

Copy code

# cat /var/lib/rancher/rke2/agent/etc/rke2-api-server-agent-load-balancer.json
{
  "ServerURL": "<https://192.168.0.48:6443>",
  "ServerAddresses": [
    "192.168.0.50:6443"
  ],
  "Listener": null

gifted-cricket-25537

11/01/2023, 7:49 PM

^^ for the apiserver LB

gifted-cricket-25537

11/01/2023, 7:51 PM

152 NotReady nodes left :D

creamy-pencil-82913

11/01/2023, 7:52 PM

for the apiserver, just stopping rke2 wont’ do it, you’d have to kill the apiserver static pod as well.

rke2-killall.sh

is probably the easiest way to do that.

gifted-cricket-25537

11/01/2023, 7:56 PM

I saw that with rke2-server restart they started to swarm the first apiserver tho'

creamy-pencil-82913

11/01/2023, 7:58 PM

for the supervisor yeah, but the apiserver lb is separate - at least with regards to forcing it to fail over

gifted-cricket-25537

11/01/2023, 7:58 PM

so there's movement, if I keep killing the "active" apiservers once they're overloaded, the agents will go here and there and eventually all nodes will be Ready

gifted-cricket-25537

11/01/2023, 8:03 PM

thx for all the help, it's getting late here, really appreciate!

5 Views

Open in Slack

Previous Next