Is there any way to automatically schedule etcd to...
# rke2
b
Is there any way to automatically schedule etcd to defrag the database? I would have thought that it would happen automatically, but that doesn't seem to be the case.
c
defrag is disruptive (stops all IO) and should not be done while the database is in use. RKE2 defrags the database every time the service is restarted. If you have something that is alerting on etcd datastore fragmentation being over some arbitrary threshold I would probably just ignore/surpress that. There are monitoring packages out there with terrible default thresholds. It is normal for there to be free pages within the allocated space, as etcd hits a steady state of things being created/deleted the unused pages allocated from disk will turn over a bit. If you’re constantly defragging the space on disk for no good reason it will just have to go reallocate space again anyway.
b
Hm Yeah the default settings for Prometheus for RKE2/Elemental start griping when the fragmentation ratio is over 50%. Typically I see it trigger every other day or so and I run it once and it's fine for another day or two. But today I saw one cluster (Harvester) go from 1.1G to ~200 MB.
Uh, correction 1.1 GiB to 81 MiB.
c
what is that figure. Unused space?
b
Let me grab the output
c
or actual db size.
b
db size
Copy code
Switched to context "compute".
Getting compute etcd Status
+------------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|           ENDPOINT           |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+------------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| <https://128.111.126.103:2379> | 20a7fa08b68045e4 |  3.5.16 |  267 MB |     false |      false |        30 |  997461295 |          997461295 |        |
| <https://128.111.126.108:2379> | 57fcd8cd8bf11a0e |  3.5.16 |  1.1 GB |      true |      false |        30 |  997461295 |          997461295 |        |
| <https://128.111.126.102:2379> | ee8de2b884379670 |  3.5.16 |  230 MB |     false |      false |        30 |  997461295 |          997461295 |        |
+------------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
Defragging the etcd in the compute cluster via etcd-kube10
Finished defragmenting etcd member[<https://128.111.126.103:2379>]
Finished defragmenting etcd member[<https://128.111.126.108:2379>]
Finished defragmenting etcd member[<https://128.111.126.102:2379>]
Getting compute etcd Health
+------------------------------+--------+------------+-------+
|           ENDPOINT           | HEALTH |    TOOK    | ERROR |
+------------------------------+--------+------------+-------+
| <https://128.111.126.102:2379> |   true | 5.251518ms |       |
| <https://128.111.126.108:2379> |   true | 6.502341ms |       |
| <https://128.111.126.103:2379> |   true | 7.157662ms |       |
+------------------------------+--------+------------+-------+
Getting compute etcd Status
+------------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|           ENDPOINT           |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+------------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| <https://128.111.126.103:2379> | 20a7fa08b68045e4 |  3.5.16 |   81 MB |     false |      false |        30 |  997461460 |          997461460 |        |
| <https://128.111.126.108:2379> | 57fcd8cd8bf11a0e |  3.5.16 |   81 MB |      true |      false |        30 |  997461460 |          997461460 |        |
| <https://128.111.126.102:2379> | ee8de2b884379670 |  3.5.16 |   81 MB |     false |      false |        30 |  997461461 |          997461461 |        |
+------------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
Essentially running :
kubectl -n kube-system exec -it ${etcdnode} -- etcdctl --endpoints 127.0.0.1:2379 --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key endpoint status --cluster -w table
and
kubectl -n kube-system exec -it ${etcdnode} -- etcdctl --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt defrag --cluster
c
its just directly proportional to how much you’re storing in there. It will not return space to the OS by default - so if you have a bunch of events or other temporary resources that cause it to need to store 1gb of resources, 1gb is where it will stay - even after those resources are deleted and the pages are freed.
Hopefully 1gb of disk space isn’t going to break the camel’s back. That’s nothing these days. I would just leave it alone, its not a problem, its not hurting anything.
b
The way the warning were worded it seemed like there was a bunch of stale data it was hanging onto that could cause issues and the defrag only keeps what's active/relevant.
c
no. it’s not a problem. the pages just are there when etcd needs them, without having to grow the file on disk.
the alert as a whole is garbage as far as I’m concerned
b
👍 thx
It's good to know. 🙂