This message was deleted Rancher Users #rke2

Join Slack

This message was deleted.

# rke2

adamant-kite-43734

08/25/2023, 1:40 PM

This message was deleted.

swift-sunset-4572

08/25/2023, 1:55 PM

not even able to ssh

creamy-pencil-82913

08/25/2023, 4:31 PM

If you can't even ssh to it, I'm not surprised rke2 doesn't work either. Can you identify why it is becoming unresponsive? Set up syslog or something else to get logs off the box for troubleshooting?

swift-sunset-4572

08/25/2023, 5:20 PM

So basically i have a centralized cluster of 3ctl plane and 5 worker nodes On it i have deployed two downstream clusters Cluster-dev - 14 nodes Cluster-qa - 17nodes But when i started provisioning the cluster-uat - 17nodes it started giving me the above error on dev and qa clusters Like you said i did some trobleshooting i found out that the same vcenter datastore is being used on all the downstream clusters and by the time i start provisioning the uat cluster the disk space on datastore is around 3.8 tb used out of 4 tb Could this be a reason that causing resource crunch on the dev and qa? Because when i removed the uat cluster then slowly the nodes of dev and qa cluster went back to running state from the disk pressure

creamy-pencil-82913

08/25/2023, 5:25 PM

sounds likely

creamy-pencil-82913

08/25/2023, 5:27 PM

you should also be aware of latency. etcd nodes have strict IO latency requirements. If they’re all on the same datastore, then disk IO from image pulls and such on one node, could affect etcd latency on others. Make sure you’re not overloading the datastore - not just with the size, but also the throughput.

💯 1

swift-sunset-4572

08/25/2023, 7:15 PM

Does that mean i should rather have separate data stores for environments ? And how can i be sure that it doesnt get over load with throughputs ?

creamy-pencil-82913

08/25/2023, 7:17 PM

managing shared infrastructure is kinda below the level of RKE2, I don’t know your environment or infrastructure well enough to make any recommendations. In general, shared infrastructure needs to be sufficient to meet the peak load requirements of everything you put on it.

creamy-pencil-82913

08/25/2023, 7:17 PM

You wouldn’t want your dev environment to take down prod because a developer decided to load-test something, and that took all the CPU or disk IO away from your prod cluster.

creamy-pencil-82913

08/25/2023, 7:18 PM

to be honest, it kinda sounds like you don’t actually have enough infrastructure to do everything you want to do.

swift-sunset-4572

08/25/2023, 7:20 PM

Yeah the data stores that are provided to me are for centralised clusters ( prod and non prod setup ) and then 1data store for ( dev qa and uat ) and 1 for downstream prod You are right its better to have segregation than to have shared disk iops that will lead to cluster nodes inconsistency.

swift-sunset-4572

08/28/2023, 7:41 AM

whats the default iops required for etcd on rancher ? is the below article correct requirement ??

Copy code

Fast disks are the most critical factor for etcd deployment performance and stability.
A slow disk will increase etcd request latency and potentially hurt cluster stability. Since etcd's consensus protocol depends on persistently storing metadata to a log, a majority of etcd cluster members must write every request down to disk. Additionally, etcd will also incrementally checkpoint its state to disk so it can truncate this log. If these writes take too long, heartbeats may time out and trigger an election, undermining the stability of the cluster.
etcd is very sensitive to disk write latency. Typically 50 sequential IOPS (e.g., a 7200 RPM disk) is required. For heavily loaded clusters, 500 sequential IOPS (e.g., a typical local SSD or a high performance virtualized block device) is recommended. Note that most cloud providers publish concurrent IOPS rather than sequential IOPS; the published concurrent IOPS can be 10x greater than the sequential IOPS. To measure actual sequential IOPS, we suggest using a disk benchmarking tool such as diskbench or fio.
etcd requires only modest disk bandwidth but more disk bandwidth buys faster recovery times when a failed member has to catch up with the cluster. Typically 10MB/s will recover 100MB data within 15 seconds. For large clusters, 100MB/s or higher is suggested for recovering 1GB data within 15 seconds.
When possible, back etcd's storage with a SSD. A SSD usually provides lower write latencies and with less variance than a spinning disk, thus improving the stability and reliability of etcd. If using spinning disk, get the fastest disks possible (15,000 RPM). Using RAID 0 is also an effective way to increase disk speed, for both spinning disks and SSD. With at least three cluster members, mirroring and/or parity variants of RAID are unnecessary; etcd's consistent replication already gets high availability

4 Views

Open in Slack

Previous Next