This message was deleted.
# longhorn-storage
a
This message was deleted.
a
Not sure about why this happened, @crooked-cat-21365 Could you please provide the support bundle? Thanks
c
Following the guideline on https://www.suse.com/support/kb/doc/?id=000020145 I came to step 4. There was no index.html button and esp no "Generate Support Bundle" feature to find.
Found it, it is not in Rancher but the Longhorn GUI.
👍 1
The support bundle is about 10 MByte. Too much to post here, but I could send an EMail. What would you suggest?
a
Thanks, you could send to this mailbox longhorn-support-bundle@Suse.com
c
On its way. Thank you very much for your help
🙂 1
👍 1
It is just 3 MByte compressed.
a
Hi, could you please provide the details about you cluster, how many worker nodes, RAMs, OS, VM/bare-metal, and when OOM happened, what might you be doing on cluster?
c
bare metal, 3 identical hosts, each with 24 cores + HT, 512 GByte RAM, no swap. Debian 11, cgroup/cgroupv2 (booted with systemd.unified_cgroup_hierarchy=0). All hosts installed via Rancher v2.7.1, K3s 1.24.10+k3s1. All control plane and worker nodes. This is a development cluster, used to run CI&CD pipelines (gitlab runners). Java 8 and 17, Maven, Docker-in- docker. They also run software previews for testing, thats where Longhorn comes into the game. Some of these build pipelines can take an awful amount of memory, because JDK isn't that good in watching memory restrictions. It is possible that some very old JDK8s without cgroup awareness are involved.
PS: I am not even sure that this is actually a problem. This morning I found
Copy code
{hdunkel@dpcl082:~ 08:50:42 (local) 1002} kubectl oomd -A
NAMESPACE                     POD                                               CONTAINER                         REQUEST     LIMIT     TERMINATION TIME
longhorn-system               engine-image-ei-87057037-k52s9                    engine-image-ei-87057037          0           0         2023-07-24 13:18:10 +0200 CEST
longhorn-system               engine-image-ei-87057037-trxfw                    engine-image-ei-87057037          0           0         2023-06-28 12:00:19 +0200 CEST
longhorn-system               engine-image-ei-ef01bf86-gjmzd                    engine-image-ei-ef01bf86          0           0         2023-07-24 13:18:08 +0200 CEST
longhorn-system               engine-image-ei-ef01bf86-spq6b                    engine-image-ei-ef01bf86          0           0         2023-06-28 12:00:19 +0200 CEST
The same containers were killed by oomd again, which implies they were restarted before.
a
Actually the
engine-image-ei
pod just only do some commands, and it should not trigger OOM.
Have you checked another applications memory usage, as you mentioned old JDK? but it weird that you only got the
engine-image-ei-xxx
killed, or could you observe the memory usage of pods of engine-image?
c
Maybe the oomd kubectl plugin is buggy, but it reports only the engine-image-ei-something pods. Today:
Copy code
% kubectl oomd -A
NAMESPACE           POD                                CONTAINER                    REQUEST     LIMIT     TERMINATION TIME
longhorn-system     engine-image-ei-87057037-k52s9     engine-image-ei-87057037     0           0         2023-07-24 13:18:10 +0200 CEST
longhorn-system     engine-image-ei-87057037-pfxx8     engine-image-ei-87057037     0           0         2023-07-26 07:35:24 +0200 CEST
longhorn-system     engine-image-ei-87057037-trxfw     engine-image-ei-87057037     0           0         2023-06-28 12:00:19 +0200 CEST
longhorn-system     engine-image-ei-ef01bf86-5wsb8     engine-image-ei-ef01bf86     0           0         2023-07-25 11:10:48 +0200 CEST
longhorn-system     engine-image-ei-ef01bf86-gjmzd     engine-image-ei-ef01bf86     0           0         2023-07-24 13:18:08 +0200 CEST
longhorn-system     engine-image-ei-ef01bf86-spq6b     engine-image-ei-ef01bf86     0           0         2023-06-28 12:00:19 +0200 CEST
The oomd seems to be an internal Kubernetes thing. There are "real" ooms listed in kernel.log as well (not related to Longhorn). kubectl oomd doesn't list these "real" ooms. AFAICT the oomd kills pods, if resources are getting tight. The pods not requesting any resources are killed first.
b
Have you tried the Priority Class setting to see if it helps?
c
No, not yet. Thank you very much for the pointer. I will try.