This message was deleted Rancher Users #longhorn-storage

Join Slack

This message was deleted.

# longhorn-storage

adamant-kite-43734

07/23/2023, 3:23 PM

This message was deleted.

aloof-hair-13897

07/24/2023, 2:52 AM

Not sure about why this happened, @crooked-cat-21365 Could you please provide the support bundle? Thanks

crooked-cat-21365

07/24/2023, 7:50 AM

Following the guideline on https://www.suse.com/support/kb/doc/?id=000020145 I came to step 4. There was no index.html button and esp no "Generate Support Bundle" feature to find.

crooked-cat-21365

07/24/2023, 7:56 AM

Found it, it is not in Rancher but the Longhorn GUI.

👍 1

crooked-cat-21365

07/24/2023, 7:57 AM

The support bundle is about 10 MByte. Too much to post here, but I could send an EMail. What would you suggest?

aloof-hair-13897

07/24/2023, 7:58 AM

Thanks, you could send to this mailbox longhorn-support-bundle@Suse.com

crooked-cat-21365

07/24/2023, 8:08 AM

On its way. Thank you very much for your help

🙂 1

👍 1

crooked-cat-21365

07/24/2023, 8:17 AM

It is just 3 MByte compressed.

aloof-hair-13897

07/25/2023, 12:59 AM

Hi, could you please provide the details about you cluster, how many worker nodes, RAMs, OS, VM/bare-metal, and when OOM happened, what might you be doing on cluster?

crooked-cat-21365

07/25/2023, 5:52 AM

bare metal, 3 identical hosts, each with 24 cores + HT, 512 GByte RAM, no swap. Debian 11, cgroup/cgroupv2 (booted with systemd.unified_cgroup_hierarchy=0). All hosts installed via Rancher v2.7.1, K3s 1.24.10+k3s1. All control plane and worker nodes. This is a development cluster, used to run CI&CD pipelines (gitlab runners). Java 8 and 17, Maven, Docker-in- docker. They also run software previews for testing, thats where Longhorn comes into the game. Some of these build pipelines can take an awful amount of memory, because JDK isn't that good in watching memory restrictions. It is possible that some very old JDK8s without cgroup awareness are involved.

crooked-cat-21365

07/25/2023, 6:53 AM

PS: I am not even sure that this is actually a problem. This morning I found

Copy code

{hdunkel@dpcl082:~ 08:50:42 (local) 1002} kubectl oomd -A
NAMESPACE                     POD                                               CONTAINER                         REQUEST     LIMIT     TERMINATION TIME
longhorn-system               engine-image-ei-87057037-k52s9                    engine-image-ei-87057037          0           0         2023-07-24 13:18:10 +0200 CEST
longhorn-system               engine-image-ei-87057037-trxfw                    engine-image-ei-87057037          0           0         2023-06-28 12:00:19 +0200 CEST
longhorn-system               engine-image-ei-ef01bf86-gjmzd                    engine-image-ei-ef01bf86          0           0         2023-07-24 13:18:08 +0200 CEST
longhorn-system               engine-image-ei-ef01bf86-spq6b                    engine-image-ei-ef01bf86          0           0         2023-06-28 12:00:19 +0200 CEST

The same containers were killed by oomd again, which implies they were restarted before.

aloof-hair-13897

07/25/2023, 7:46 AM

Actually the

engine-image-ei

pod just only do some commands, and it should not trigger OOM.

aloof-hair-13897

07/26/2023, 1:35 AM

Have you checked another applications memory usage, as you mentioned old JDK? but it weird that you only got the

engine-image-ei-xxx

killed, or could you observe the memory usage of pods of engine-image?

crooked-cat-21365

07/26/2023, 6:44 AM

Maybe the oomd kubectl plugin is buggy, but it reports only the engine-image-ei-something pods. Today:

Copy code

% kubectl oomd -A
NAMESPACE           POD                                CONTAINER                    REQUEST     LIMIT     TERMINATION TIME
longhorn-system     engine-image-ei-87057037-k52s9     engine-image-ei-87057037     0           0         2023-07-24 13:18:10 +0200 CEST
longhorn-system     engine-image-ei-87057037-pfxx8     engine-image-ei-87057037     0           0         2023-07-26 07:35:24 +0200 CEST
longhorn-system     engine-image-ei-87057037-trxfw     engine-image-ei-87057037     0           0         2023-06-28 12:00:19 +0200 CEST
longhorn-system     engine-image-ei-ef01bf86-5wsb8     engine-image-ei-ef01bf86     0           0         2023-07-25 11:10:48 +0200 CEST
longhorn-system     engine-image-ei-ef01bf86-gjmzd     engine-image-ei-ef01bf86     0           0         2023-07-24 13:18:08 +0200 CEST
longhorn-system     engine-image-ei-ef01bf86-spq6b     engine-image-ei-ef01bf86     0           0         2023-06-28 12:00:19 +0200 CEST

The oomd seems to be an internal Kubernetes thing. There are "real" ooms listed in kernel.log as well (not related to Longhorn). kubectl oomd doesn't list these "real" ooms. AFAICT the oomd kills pods, if resources are getting tight. The pods not requesting any resources are killed first.

billowy-painting-56466

07/31/2023, 3:13 AM

Have you tried the Priority Class setting to see if it helps?

crooked-cat-21365

08/04/2023, 1:02 PM

No, not yet. Thank you very much for the pointer. I will try.

4 Views

Open in Slack

Previous Next