This message was deleted.
# general
a
This message was deleted.
c
I've made no recent changes. It's been stable on v2.8.0 for a month or so
oh, it seems I have 25k longhorn "backup" entries in my etcd
only 5k of my entries aren't longhorn backups πŸ˜„
getting "context cancelled", which causes a race condition in the kube-apiserver
having to constantly stop all kube-apiservers so etcd has enough RAM/CPU to breath, then I have to restart and try to delete some backup entries in like 30 seconds
m
Why do you have swap enabled? Doom sets in once swapping starts, we're supposed to rely on OOM killer as bad as that sounds.
πŸ‘€ 1
c
I found out it's because 85% of my etcd entries were all of one resource type and apparently kube-apiserver just kills itself after 1 "took too long".
m
Ah yes, the Timeout of Doom
c
is that a known thing? is there a fix? haha
m
Food for thought and grounds for further research as the radio hosts are known to say during the call-in segment
πŸ˜„ 1
c
would even adding more etcd nodes prevent this?
m
maybe increase the toxicity of this same scenario so it happens faster and then run on a different version of k8s, or several different versions
to learn more about it
then open a support ticket with the boiled down problem
πŸ‘ 1
c
guess I gotta make sure every resource retention policy exists/works at all πŸ˜„
😁 1