This message was deleted.
# rke2
a
This message was deleted.
g
Copy code
curl -H "rke2-Node-Name: <hostname>" -H "rke2-Node-Password: <password>" <https://192.168.0.48:9345/v1-rke2/serving-kubelet.crt> -k --cert /var/lib/rancher/rke2/agent/client-kubelet.crt --key /var/lib/rancher/rke2/agent/client-kubelet.key -vvv
*   Trying 192.168.0.48:9345...
* TCP_NODELAY set
* Connected to 192.168.0.48 (192.168.0.48) port 9345 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: /etc/ssl/certs
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, Request CERT (13):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, Certificate (11):
* TLSv1.3 (OUT), TLS handshake, CERT verify (15):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256
* ALPN, server accepted to use h2
* Server certificate:
*  subject: O=rke2; CN=rke2
*  start date: Oct 19 11:35:18 2023 GMT
*  expire date: Oct 18 11:37:59 2024 GMT
*  issuer: CN=rke2-server-ca@1697715318
*  SSL certificate verify result: self signed certificate in certificate chain (19), continuing anyway.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0x55e9935ee300)
> GET /v1-rke2/serving-kubelet.crt HTTP/2
> Host: 192.168.0.48:9345
> user-agent: curl/7.68.0
> accept: */*
> rke2-node-name: <hostname>
> rke2-node-password: <password>
>
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* Connection state changed (MAX_CONCURRENT_STREAMS == 250)!
and it hangs, although the
/readyz
endpoint is ok
unrelated node, but on the other side I see: Nov 01 075505 <controlplane> rke2[3525]: I1101 075505.665245 3525 request.go:690] Waited for 10h11m15.660255046s due to client-side throttling, not priority and fairness, request: GET:https://127.0.0.1:6443/api/v1/namespaces/kube-system/secrets/&lt;hostname&gt;.node-password.rke2
10h throttle 😄
would be nice to know how to tune the client kube-api qps here
this is clearly a concurrency bug as after restarting the rke2-server it starts to reply immediately
@creamy-pencil-82913 wdyt?
I have 320 nodes in the cluster and tried to big-bang restart. v1.25.14+rke2r1 and Rancher 2.7.6
another occasion, this time, if I direct the request to the apiserver, it works, but hangs with the local loadbalancer:
Copy code
curl -H "rke2-Node-Name: <hostname>" -H "rke2-Node-Password: <password>" <https://apiserver:9345/v1-rke2/serving-kubelet.crt> -k --cert /var/lib/rancher/rke2/agent/client-kubelet.crt --key /var/lib/rancher/rke2/agent/client-kubelet.key -vvv
*   Trying 192.168.0.48:9345...
* TCP_NODELAY set
* Connected to 192.168.0.48 (192.168.0.48) port 9345 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: /etc/ssl/certs
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, Request CERT (13):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, Certificate (11):
* TLSv1.3 (OUT), TLS handshake, CERT verify (15):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256
* ALPN, server accepted to use h2
* Server certificate:
*  subject: O=rke2; CN=rke2
*  start date: Oct 19 11:35:18 2023 GMT
*  expire date: Oct 18 11:37:59 2024 GMT
*  issuer: CN=rke2-server-ca@1697715318
*  SSL certificate verify result: self signed certificate in certificate chain (19), continuing anyway.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0x56372e206300)
> GET /v1-rke2/serving-kubelet.crt HTTP/2
> Host: 192.168.0.48:9345
> user-agent: curl/7.68.0
> accept: */*
> rke2-node-name: <hostname>
> rke2-node-password: <password>
>
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* Connection state changed (MAX_CONCURRENT_STREAMS == 250)!
< HTTP/2 200
< content-type: text/plain; charset=utf-8
< content-length: 1506
< date: Wed, 01 Nov 2023 10:46:24 GMT
<
-----BEGIN CERTIFICATE-----
...
Copy code
# curl -H "rke2-Node-Name: <hostname>" -H "rke2-Node-Password: <password>" <https://127.0.0.1:6444/v1-rke2/serving-kubelet.crt> -k --cert /var/lib/rancher/rke2/agent/client-kubelet.crt --key /var/lib/rancher/rke2/agent/client-kubelet.key -vvv
*   Trying 127.0.0.1:6444...
* TCP_NODELAY set
* Connected to 127.0.0.1 (127.0.0.1) port 6444 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: /etc/ssl/certs
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, Request CERT (13):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, Certificate (11):
* TLSv1.3 (OUT), TLS handshake, CERT verify (15):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256
* ALPN, server accepted to use h2
* Server certificate:
*  subject: O=rke2; CN=rke2
*  start date: Oct 19 11:35:18 2023 GMT
*  expire date: Oct 18 11:37:59 2024 GMT
*  issuer: CN=rke2-server-ca@1697715318
*  SSL certificate verify result: self signed certificate in certificate chain (19), continuing anyway.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0x55da0c6b7300)
> GET /v1-rke2/serving-kubelet.crt HTTP/2
> Host: 127.0.0.1:6444
> user-agent: curl/7.68.0
> accept: */*
> rke2-node-name: <hostname>
> rke2-node-password: <password>
>
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* Connection state changed (MAX_CONCURRENT_STREAMS == 250)!
and that's because the IP added to the
rke2-agent-load-balancer
indeed does not reply, and the one that works was removed as it wasn't working during the startup of the rke2-agent
Copy code
msg="Adding server to load balancer rke2-agent-load-balancer: <apiserver 1>:9345"
msg="Adding server to load balancer rke2-agent-load-balancer: <apiserver 3>:9345"
msg="Removing server from load balancer rke2-agent-load-balancer: <apiserver 1>:9345"
msg="Running load balancer rke2-agent-load-balancer 127.0.0.1:6444 -> [<apiserver 3>:9345] [default: <apiserver 1>:9345]"
msg="failed to get CA certs: Get \"<https://127.0.0.1:6444/cacerts>\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
and even with restarting the rke2-agent the same happens, it tries to talk to
apiserver 3
and removes the working
apiserver 1
from the agent LB
c
The load-balancer list is populated from the list of endpoints for the Kubernetes service in the default namespace:
kubectl get endpoints kubernetes
g
yeah, since I restarted all nodes at once - to test this scenario - there was only one node active - I assume, but new nodes were never added as they came up
Also raised this https://github.com/rancher/rke2/issues/4975 to summarize my findings
c
You’d need to check that endpoint list to confirm that it’s doing the right thing. I suspect that some of the apiservers are getting overloaded with the rush of clients, which causes the issues you’re seeing - the excessive client-side throttling, and removal of servers from the endpoint list. You might check to see if some of the apiserver static pods are crashing and restarting?
the load-balancer switches servers whenever a backend cannot be dialed successfully, it doesn’t know anything about the state of requests going through the connection. If it is able to connect, but the resulting requests hang, it will never switch backends.
g
for the latter, I get it, although in the end it's misleading for the agent as it can't perform the action it wants and retries the same apiserver again and again without luck
let me do another round of restart and show the endpoint status
the apiserver pods btw are stable, I have in total 36 cores for the apiservers (+controller managers), so I would assume that should be enough 🙂
Now containerd wasn't started on etcd nodes.
Copy code
Nov 01 17:36:37 production-island10-overcloud-etcd-2 rke2[708]: time="2023-11-01T17:36:37Z" level=info msg="Waiting for cri connection: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial unix /run/k3s/containerd/containerd.sock: connect: no such file or directory\""
even after restarting the rke2-server on this etcd node, I don't see containerd coming up
etcd-1 - no etcd-2 - no etcd-3 - yes etcd-4 - no etcd-5 - yes so I don't have quorum
tried rebooting etcd-2 no luck
if I'd know why containerd can't come up...
I wonder if it'd be better to run my own containerd instead of the embedded one
cluster is bricked 😕
lol, it took a looong time, but at least I have quorum:
Copy code
sh-4.4# etcdctl --cacert="/var/lib/rancher/rke2/server/tls/etcd/server-ca.crt" --cert="/var/lib/rancher/rke2/server/tls/etcd/server-client.crt" --key="/var/lib/rancher/rke2/server/tls/etcd/server-client.key" --endpoints="<https://192.168.0.16:2379>,<https://192.168.0.17:2379>,<https://192.168.0.18:2379><https://192.168.0.19:2379>,<https://192.168.0.20:2379>" endpoint status --write-out table
{"level":"warn","ts":"2023-11-01T18:20:04.314357Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"<etcd-endpoints://0xc0003c8380/192.168.0.16:2379>","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp: address 192.168.0.18:2379https:: too many colons in address\""}
Failed to get the status of endpoint <https://192.168.0.18:2379><https://192.168.0.19:2379> (context deadline exceeded)
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|         ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| <https://192.168.0.16:2379> | 90854c1b43df1af2 |   3.5.9 |   82 MB |     false |      false |         6 |   54149274 |           54149274 |        |
| <https://192.168.0.17:2379> | 6336293dad683176 |   3.5.9 |   82 MB |     false |      false |         6 |   54149274 |           54149274 |        |
| <https://192.168.0.20:2379> |  6df555c9f60573b |   3.5.9 |  394 MB |      true |      false |         6 |   54149345 |           54149345 |        |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
so endpoints:
Copy code
kubectl get endpoints kubernetes
NAME         ENDPOINTS                                               AGE
kubernetes   192.168.0.48:6443,192.168.0.49:6443,192.168.0.50:6443   13d
c
you might check the containerd logs on that node if it ever fails to come up again - that usually indicates that containerd is stalled checking something. I’ve never seen it take more than a few seconds though.
g
yeah me neither, it was strange
so back to the endpoints issue, now controlplane-3 churns, the other two chills:
Copy code
Nov 01 18:30:23 production-island10-overcloud-controlplane-3 rke2[5989]: time="2023-11-01T18:30:23Z" level=info msg="certificate CN=production-island10-overcloud-worker-95 signed by CN=rke2-server-ca@1697715318: notBefore=2023-10-19 11:35:18 +0000 UTC notAfter=2024-10-31 18:30:23 +0000 UTC"
Nov 01 18:30:24 production-island10-overcloud-controlplane-3 rke2[5989]: time="2023-11-01T18:30:24Z" level=info msg="certificate CN=production-island10-overcloud-worker-190 signed by CN=rke2-server-ca@1697715318: notBefore=2023-10-19 11:35:18 +0000 UTC notAfter=2024-10-31 18:30:24 +0000 UTC"
Nov 01 18:30:24 production-island10-overcloud-controlplane-3 rke2[5989]: time="2023-11-01T18:30:24Z" level=info msg="certificate CN=production-island10-overcloud-worker-76 signed by CN=rke2-server-ca@1697715318: notBefore=2023-10-19 11:35:18 +0000 UTC notAfter=2024-10-31 18:30:24 +0000 UTC"
this goes constantly for all 281 NotReady nodes I have now
logs from a
NotReady
node:
Copy code
Nov 01 17:30:14 production-island10-overcloud-worker-99 rke2[927]: time="2023-11-01T17:30:14Z" level=info msg="Adding server to load balancer rke2-agent-load-balancer: 192.168.0.48:9345"
Nov 01 17:30:14 production-island10-overcloud-worker-99 rke2[927]: time="2023-11-01T17:30:14Z" level=info msg="Adding server to load balancer rke2-agent-load-balancer: 192.168.0.50:9345"
Nov 01 17:30:14 production-island10-overcloud-worker-99 rke2[927]: time="2023-11-01T17:30:14Z" level=info msg="Removing server from load balancer rke2-agent-load-balancer: 192.168.0.48:9345"
Nov 01 17:30:14 production-island10-overcloud-worker-99 rke2[927]: time="2023-11-01T17:30:14Z" level=info msg="Running load balancer rke2-agent-load-balancer 127.0.0.1:6444 -> [192.168.0.50:9345] [default: 192.168.0.48:9345]"
Copy code
192.168.0.50 is production-island10-overcloud-controlplane-3
k8 service endpoints are stable, no changes there, all three added
I restart the rke2-agent on
production-island10-overcloud-worker-99
Copy code
Nov 01 18:37:59 production-island10-overcloud-worker-99 rke2[1163]: time="2023-11-01T18:37:59Z" level=info msg="Starting rke2 agent v1.25.14+rke2r1 (36d7417e024e8dad34ebbf94b210ab3dd0f52cd7)"
Nov 01 18:37:59 production-island10-overcloud-worker-99 rke2[1163]: time="2023-11-01T18:37:59Z" level=info msg="Adding server to load balancer rke2-agent-load-balancer: 192.168.0.48:9345"
Nov 01 18:37:59 production-island10-overcloud-worker-99 rke2[1163]: time="2023-11-01T18:37:59Z" level=info msg="Adding server to load balancer rke2-agent-load-balancer: 192.168.0.50:9345"
Nov 01 18:37:59 production-island10-overcloud-worker-99 rke2[1163]: time="2023-11-01T18:37:59Z" level=info msg="Removing server from load balancer rke2-agent-load-balancer: 192.168.0.48:9345"
Nov 01 18:37:59 production-island10-overcloud-worker-99 rke2[1163]: time="2023-11-01T18:37:59Z" level=info msg="Running load balancer rke2-agent-load-balancer 127.0.0.1:6444 -> [192.168.0.50:9345] [default: 192.168.0.48:9345]"
Nov 01 18:37:59 production-island10-overcloud-worker-99 rke2[1163]: time="2023-11-01T18:37:59Z" level=warning msg="Cluster CA certificate is not trusted by the host CA bundle, but the token does not include a CA hash. Use the full token from the server's node-token file to enable Cluster CA validation."
Nov 01 18:38:00 production-island10-overcloud-worker-99 rke2[1163]: time="2023-11-01T18:38:00Z" level=info msg="Adding server to load balancer rke2-api-server-agent-load-balancer: 192.168.0.48:6443"
Nov 01 18:38:00 production-island10-overcloud-worker-99 rke2[1163]: time="2023-11-01T18:38:00Z" level=info msg="Adding server to load balancer rke2-api-server-agent-load-balancer: 192.168.0.50:6443"
Nov 01 18:38:00 production-island10-overcloud-worker-99 rke2[1163]: time="2023-11-01T18:38:00Z" level=info msg="Removing server from load balancer rke2-api-server-agent-load-balancer: 192.168.0.48:6443"
Nov 01 18:38:00 production-island10-overcloud-worker-99 rke2[1163]: time="2023-11-01T18:38:00Z" level=info msg="Running load balancer rke2-api-server-agent-load-balancer 127.0.0.1:6443 -> [192.168.0.50:6443] [default: 192.168.0.48:6443]"
Nov 01 18:38:11 production-island10-overcloud-worker-99 rke2[1163]: time="2023-11-01T18:38:11Z" level=info msg="Waiting to retrieve agent configuration; server is not ready: Get \"<https://127.0.0.1:6444/v1-rke2/serving-kubelet.crt>\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
the same as I said, it goes to the same controlplane node as before nevertheless
Copy code
kube-apiserver-production-island10-overcloud-controlplane-1             1/1     Running       24 (39m ago)    8d
kube-apiserver-production-island10-overcloud-controlplane-2             1/1     Running       24 (40m ago)    8d
kube-apiserver-production-island10-overcloud-controlplane-3             1/1     Running       17 (39m ago)    8d
apiservers are up for ~40m
if now I restart
rke2-server
on
production-island10-overcloud-controlplane-3
the other two apiservers will start to receive requests
of course after restart, clients started to bombard
Copy code
192.168.0.48 is production-island10-overcloud-controlplane-1
but
production-island10-overcloud-worker-99
is still timing out nevertheless
Copy code
Nov 01 18:43:33 production-island10-overcloud-worker-99 rke2[1163]: time="2023-11-01T18:43:33Z" level=info msg="Waiting to retrieve agent configuration; server is not ready: Get \"<https://127.0.0.1:6444/v1-rke2/serving-kubelet.crt>\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
but:
Copy code
root@production-island10-overcloud-worker-99:~# curl -H "rke2-Node-Name: production-island10-overcloud-worker-99" -H "rke2-Node-Password: <password>" <https://127.0.0.1:6444/v1-rke2/readyz> -k --cert /var/lib/rancher/rke2/agent/client-kubelet.crt --key /var/lib/rancher/rke2/agent/client-kubelet.key
ok
Copy code
root@production-island10-overcloud-worker-99:~# cat /var/lib/rancher/rke2/agent/etc/rke2-agent-load-balancer.json
{
  "ServerURL": "<https://192.168.0.48:9345>",
  "ServerAddresses": [
    "192.168.0.50:9345"
  ],
  "Listener": null
}
c
Why is
192.168.0.50
the only one that it has in the json cache file?
192.168.0.48
is the default (from the --server uri) but I am curious why it only has that one address in the list.
Were the other two unavailable at the time you shut this node down? If you look at the logs from the agent prior to the restart, do you see it removing the other two addresses from the list?
g
yes, as I said it was a big bang restart
and even after I restarted rke2-agent on worker-99 the list remained the same
c
well yeah it’s not going to be able to update the list until it can talk to the apiserver, and apparently the apiserver on that node that it wants to connect to is overloaded by incoming requests
g
Copy code
root@production-island10-overcloud-worker-99:~# curl -H "rke2-Node-Name: production-island10-overcloud-worker-99" -H "rke2-Node-Password: password<>" <https://192.168.0.48:9345/v1-rke2/serving-kubelet.crt> -k --cert /var/lib/rancher/rke2/agent/client-kubelet.crt --key /var/lib/rancher/rke2/agent/client-kubelet.key
^C

root@production-island10-overcloud-worker-99:~# curl -H "rke2-Node-Name: production-island10-overcloud-worker-99" -H "rke2-Node-Password: <password>" <https://192.168.0.49:9345/v1-rke2/serving-kubelet.crt> -k --cert /var/lib/rancher/rke2/agent/client-kubelet.crt --key /var/lib/rancher/rke2/agent/client-kubelet.key
-----BEGIN CERTIFICATE-----
...
root@production-island10-overcloud-worker-99:~# curl -H "rke2-Node-Name: production-island10-overcloud-worker-99" -H "rke2-Node-Password: <password>" <https://192.168.0.50:9345/v1-rke2/serving-kubelet.crt> -k --cert /var/lib/rancher/rke2/agent/client-kubelet.crt --key /var/lib/rancher/rke2/agent/client-kubelet.key
-----BEGIN CERTIFICATE-----
...
so, as now clients swarmed .48 it can't reply now to these requests, but the other two would be happy to answer
the agent loadbalancer could WATCH the endpoints and update upon new endpoints appear
c
that is exactly what it does
but it can’t do that until it can talk to the apiserver
and you’ve overloaded the apiserver with 300+ clients all starting up at once
Generally when building a cluster we recommend adding nodes in smaller batches, I would recommend doing the same when coming up from a cold start. If you try to bring them all up at once you’re going to DDOS things
g
Copy code
<https://192.168.0.48:9345/v1-rke2/readyz>
says
ok
apiserver is not overloaded, the rke2-server is eating the cpu
looks like it constantly regenerates the certs for kubelet
c
All that readyz endpoint indicates is that the apiserver came up and rke2 was able to initialize its core controllers.
g
such as:
Copy code
Nov 01 18:55:23 production-island10-overcloud-controlplane-1 rke2[6447]: time="2023-11-01T18:55:23Z" level=info msg="certificate CN=production-island10-overcloud-worker-99 signed by CN=rke2-server-ca@1697715318: notBefore=2023-10-19 11:35:18 +0000 UTC notAfter=2024-10-31 18:55:23 +0000 UTC"
Nov 01 18:56:14 production-island10-overcloud-controlplane-1 rke2[6447]: time="2023-11-01T18:56:14Z" level=info msg="certificate CN=production-island10-overcloud-worker-99 signed by CN=rke2-server-ca@1697715318: notBefore=2023-10-19 11:35:18 +0000 UTC notAfter=2024-10-31 18:56:14 +0000 UTC"
apiserver eats 0.3 cores, while rke2-server eats 11 cores now
c
all that log shows is that the agent on that node requested a cert two times. the agent will request certs every time it starts; if it keeps requesting them over and over again it indicates that the agent is crashing or timing out and restarting
g
my main concern - why I'm testing - is the non-zero chance that at some point the whole cluster goes down and delays coming back up due to inefficiencies here and there
c
right, and my suggestion was that, if that happens, you don’t try to bring up all the agents at the same time. Do them in smaller batches, say 25 or so.
g
I believe the problem is that the agent constantly tries only one apiserver endpoint - despite all 3 are there and working
if it'd rotate the endpoints (and backoff potentially) it'd get through in no time
c
it tries only 1 because there was only 1 available when you shut it down. It would update its list to find the other endpoints, but it can’t because that one server (that it would get additional endpoints from) is overloaded.
g
let's see 🙂
Copy code
root@production-island10-overcloud-controlplane-1:~# systemctl restart rke2-server
once back up, this control-plance will not be swarmed anymore
c
right so the sequence is: get new certs from existing server -> connect to apiserver -> update endpoints
if it can connect to the existing server, but the cert request times out (because that server is overloaded) it will not move past that
if you stop the service on the existing server, then it will fail to connect, and try the other server (the default one, from the logs)
the underlying cause here is that your shutdown sequence left all the agents with only a single server in their cache, and you’ve started them all up at the same time, and overloaded it.
while I agree that we could potentially handle this better, the reality is that starting up hundreds of agents at the same time is never going to work super well. Even when doing our scale testing we are careful to create or restart agents in batches of 25-50.
if you’re using rancher or the system-upgrade-controller, it is done even more slowly, in single-digit batches
g
yeah, this is what happened, but this is a hell of an inefficient method unfortunately. Won't it be possible to cache the endpoints for these use-cases? ie use the cached version of client certs (it's already there!) and the cache version of
/var/lib/rancher/rke2/agent/etc/rke2-agent-load-balance.json
and update once the connection comes back?
c
using the old certs 1. wouldn’t work if they’re expired 2. would require us to restart everything again after the certs have been updated, as the core components don’t support live reloading of certificates
g
at least round-robin between the
ServerURL
and whatever is in
ServerAddresses
if there's a non nil err
c
that would be a lot more work to handle. it is a very simple l3 load balancer, as long as the endpoint can be connected to then it is used. as I said above, only if the dial fails does it remove the server from the list and start using a different one.
g
using the old certs, yeah, I get it, but which use-case is the more likely and/or more problematic? 🙂 not being able to start server or the case when you started the server with an expired cert?
is it possible for me to disable the agent LB and use something else?
c
no
it is not possible for you to avoid starting all the agents at the exact same time?
g
it is always, but I run 20k customer websites/island and if shit hits the fan all of them will start at the same time 🙂
I thought adding a sleep to the nodes upon starting up the rke2-agent
don't laugh 😄
Copy code
# /etc/systemd/system/rke2-agent.service.d/10-override.conf
[Service]
ExecStartPre=/bin/bash -xc '/bin/hostname | grep -q worker && /bin/sleep $((RANDOM % 60)) || /bin/true'
but meh it hurts my eyes 😕
c
yeah possibly a good workaround for now!
We can leave that issue open to track figuring out a way to handle this better - but at the moment just staggering the startups is probably the best way to address it
g
my idea is to check on the worker nodes if the existing client certificate is expired, if not, try to use it to update the endpoints, then if it's a must update the client certificate if it's expired or non existing, then go to apiserver first and then update the endpoints
this would be much much cheaper computationally
c
yeah I’m not sure where it’s getting overloaded; I wouldn’t expect generation of certs + keys to be that hard. If you can get a stack trace from the rke2 process when it’s chewing up 12 cores, I’d be interested to see what it’s doing.
g
I have a flamegraph
c
that’d work too
I was just going to suggest
kill -ABRT
and then grab the dump from the logs
but flame graph is better
do you have the full thing? not sure what the scale is here.
this bit here just shows that it’s verifying the agent token, and spending a bunch of time in the constant-time password comparison
g
yeah, and I'm afraid that the request.go is throttled as I wrote in my issue and that's the reason why it can't finish
c
have you considered using bootstrap tokens instead of class password style tokens?
those are handled differently, I’m curious if you’d see any different behavior
g
hm, I'm using this:
Copy code
${rancher2_cluster_v2.default.cluster_registration_token.0.node_command}
in TF
c
yeah that’s the default
just using the token
g
so, a default bootstrap password and use that?
c
if you’re using the rancher tf provider for provisioning I don’t think you can change it, it always just uses the server token
g
kk, checking
btw, I restarted rke2-agent on one notready nodes, the
ServerURL
is not doing anything, but still tried the swarmed server and didn't update the endpoints:
Copy code
Nov 01 19:34:27 production-island10-overcloud-worker-96 rke2[1186]: time="2023-11-01T19:34:27Z" level=info msg="Running load balancer rke2-agent-load-balancer 127.0.0.1:6444 -> [192.168.0.50:9345] [default: 192.168.0.48:9345]"
c
it won’t update it until it is able to talk to the apiserver. I can’t remember if it logs it later or not, you might need to check the contents of the cache file, or turn the log level up to debug, if you want to see that later.
g
yeah, I checked the cache and it's still the same
only tries to talk to the server in
ServerAddresses
and never tries
ServerURL
instead
c
as long as it can be connected to it won’t switch
you’d need to actually stop the rke2 service on the node it’s trying to connect to, to get it to fail over.
g
yeah, I did exactly that
c
are you looking at the apiserver or the supervisor lb?
g
this one for now:
Copy code
# cat /var/lib/rancher/rke2/agent/etc/rke2-agent-load-balancer.json
{
  "ServerURL": "<https://192.168.0.48:9345>",
  "ServerAddresses": [
    "192.168.0.50:9345"
  ],
  "Listener": null
Copy code
# cat /var/lib/rancher/rke2/agent/etc/rke2-api-server-agent-load-balancer.json
{
  "ServerURL": "<https://192.168.0.48:6443>",
  "ServerAddresses": [
    "192.168.0.50:6443"
  ],
  "Listener": null
^^ for the apiserver LB
152 NotReady nodes left :D
c
for the apiserver, just stopping rke2 wont’ do it, you’d have to kill the apiserver static pod as well.
rke2-killall.sh
is probably the easiest way to do that.
g
I saw that with rke2-server restart they started to swarm the first apiserver tho'
c
for the supervisor yeah, but the apiserver lb is separate - at least with regards to forcing it to fail over
g
so there's movement, if I keep killing the "active" apiservers once they're overloaded, the agents will go here and there and eventually all nodes will be Ready
thx for all the help, it's getting late here, really appreciate!