https://rancher.com/ logo
Title
l

little-smartphone-40189

09/28/2022, 1:18 PM
We are currently experiencing an issue where we are using a large (360) node cluster using the AWS CNI driver. We have noticed that this puts excessive strain on the Clusters CP/ETCD nodes due to network overload. Also, the rancher local server just randomly times out accepting responses, which ends in "error in remotedialer server 400 / websocket close 1006" errors, and those rancher pods on the local cluster stop accepting any new node registrations (for any cluster). Can't really figure out how to prevent this
q

quick-sandwich-76600

09/28/2022, 4:58 PM
How is your control layer sized (number of CP nodes and etcd nodes, vm sizes, storage type, ...). Is etcd performing ok? Have you seen errors/warnings/timeout on the etcd's log?
I can't really help with the AWS CNI driver but we can have a look at possible bottlenecks.
Have you already opened a support ticket?
l

little-smartphone-40189

09/28/2022, 6:39 PM
CP Nodes are c5n.4xlarge. ECTD Are m5.2xlarge. The ETCD nodes have GP2 volumes of 750Gig ( 2250 IOPS ) This was on Rancher 2.5.8. We are now trying the same setup on 2.6.7.
q

quick-sandwich-76600

09/29/2022, 10:02 AM
Can you keep an eye on the alarms and the performance-related issues mentioned here: https://docs.ranchermanager.rancher.io/troubleshooting/kubernetes-components/troubleshooting-etcd-nodes. I always try to discard any possible issue on the etcd side when doing this type of trouble shooting. It's also very important to look for "apply entries took too long" messages in the log as it's a clear hint that etcd is having issues. (https://etcd.io/docs/v3.1/faq/#what-does-the-etcd-warning-apply-entries-took-too-long-mean)