calm-farmer-45530
01/19/2024, 10:10 PMVERSION OS-IMAGE KERNEL-VERSION
v1.26.10+rke2r2 Ubuntu 22.04.3 LTS 6.2.0-1017-azur
I get multiple nodes daily in NotReady
state, pods all get stuck Terminating
, and the serial console shows CPU Soft Lockups. Not seeing this issue on AWS clusters, but the workloads are somewhat different.
I've scoured grafana and basically just have monitoring gaps during these periods, and cannot find any indicators of a particular pod spiking CPU before prometheus stops reporting in. So right now I'm just rebooting nodes like a moron.