This message was deleted Rancher Users #k3s

Join Slack

This message was deleted.

# k3s

adamant-kite-43734

02/06/2024, 9:27 AM

This message was deleted.

logs2.txt

lively-minister-3764

02/07/2024, 12:25 AM

After further testing on a single more powerful node (with the other difference being that it doesn't have swap), it seems like it mostly works fine, outside of some occasional cases when it does saturate the disk, but it recovers. Beyond just running a more powerful server, is there a way to solve this?

creamy-pencil-82913

02/07/2024, 12:33 AM

what’s the source of all your extra disk IO?

creamy-pencil-82913

02/07/2024, 12:33 AM

I would probably try to figure that out first. is it just all your image pulls?

lively-minister-3764

02/07/2024, 12:36 AM

Currently (with the larger node) it's seemingly everything, many processes are using >1MBps, with a k3s server process hitting ~10MBps. Before, if I remember correctly, it was just one k3s server process using 500MBps, but I'll recreate that once I get this fully working This is not during image pulls - during image pulls there's an expected spike from containerd, but it's not significant

creamy-pencil-82913

02/07/2024, 12:38 AM

are you still using k3d to run multiple k3s nodes on a single host?

lively-minister-3764

02/07/2024, 12:38 AM

No, k3d was just for local development

lively-minister-3764

02/07/2024, 12:38 AM

now I'm trying to get that to work in a real environment

creamy-pencil-82913

02/07/2024, 12:39 AM

that doesn’t seem right then. There should be only one k3s server process. Are those perhaps thread IDs? What do you see in

ps -auxf

lively-minister-3764

02/07/2024, 12:40 AM

Yeah, there's only one k3s server

lively-minister-3764

02/07/2024, 12:40 AM

thread IDs then

creamy-pencil-82913

02/07/2024, 12:40 AM

are you using etcd or kine/sqlite?

lively-minister-3764

02/07/2024, 12:41 AM

fairly certain I'm using sqlite - I didn't explicitly configure anything otherwise

creamy-pencil-82913

02/07/2024, 12:42 AM

that’s an unusually large amount of IO. Do you have something that is hammering on the apiserver with create/modify/delete requests?

lively-minister-3764

02/07/2024, 12:44 AM

My suspicion would be fluxcd, but while testing on k3d and when I was able to access grafana, the cluster always stayed under 100 apiserver requests per second

creamy-pencil-82913

02/07/2024, 12:45 AM

with k3d you might have been using a temporary datastore that was just on tmpfs, instead of real disk

creamy-pencil-82913

02/07/2024, 12:45 AM

now you’re actually putting load on it and it can’t keep up

lively-minister-3764

02/07/2024, 12:47 AM

interesting, so is it normal for k3s to such high throughput with larger volumes of requests hitting apiserver?

lively-minister-3764

02/07/2024, 12:47 AM

(presumably 500MBps is far beyond normal, but I mean the 10MBps I'm currently seeing)

creamy-pencil-82913

02/07/2024, 12:48 AM

its really just driven by what you do with it. The more you change on the apiserver, the more will get written to disk.

lively-minister-3764

02/07/2024, 12:49 AM

Ah, I'll try to get what I have barely working now to get to a grafana dashboard and see what's going on...

rough-farmer-49135

02/07/2024, 1:54 AM

As a note, you mentioned your more successful test not having swap. Kubernetes in general recommends against having swap. You might try a machine spec'd like your original but with no swap partition and see what happens.

rough-farmer-49135

02/07/2024, 1:55 AM

(or just comment out your swap partition in /etc/fstab on your original machine and reboot)

lively-minister-3764

02/07/2024, 1:58 AM

Oh, that's good to know, thank you

lively-minister-3764

02/08/2024, 5:46 AM

running the original setup without swap somehow ends up even worse, with 1.5GiBps being read...

lively-minister-3764

02/08/2024, 5:48 AM

but I'm starting to suspect that it's just running out of ram

lively-minister-3764

02/08/2024, 5:52 AM

seems like it, created swap and it suddenly dropped to comparatively reasonable numbers

rough-farmer-49135

02/08/2024, 2:43 PM

It's odd that it worked with k3d. If it worked in k3d but is going haywire in straight k3s the only think I can think of is that maybe k3d has oom killer set a bit faster and you have runaway pods getting restarted or it's purging etcd after a certain size (which with default in-memory sqlite would directly hit RAM).

rough-farmer-49135

02/08/2024, 2:43 PM

Having swap on ending up with less disk reads probably means that your cluster is just operating slower.

lively-minister-3764

02/08/2024, 6:41 PM

My only theory is it's the difference in resources - I'll try to run it in a few vms locally I'll check k3d logs, but I don't remember anything like this in logs

creamy-pencil-82913

02/08/2024, 6:43 PM

I still think its probably related to k3d etcd datastore being on tmpfs and not real disk

creamy-pencil-82913

02/08/2024, 6:43 PM

swap may also be related

lively-minister-3764

02/08/2024, 6:44 PM

Even if it's on tmpfs I should still be able to see if it's reading large amounts?

lively-minister-3764

02/08/2024, 6:45 PM

Last night I managed to get it to grafana and from what I saw, the request volume was identical to what I see locally

creamy-pencil-82913

02/08/2024, 6:46 PM

no, tmpfs is memory backed. not real disk

creamy-pencil-82913

02/08/2024, 6:47 PM

I”m talking about the IO from the k3s processes. Grafana’s disk IO will come from different processes.

lively-minister-3764

02/08/2024, 6:51 PM

Er, my last message was a bit unrelated, just wanted to point out that I didn't see a massive difference in API request rates Surprised that there's no way to monitor tmpfs, oh well

rough-farmer-49135

02/08/2024, 9:35 PM

I thought k3s default etcd was in-memory sqlite, I guess that's wrong? But if k3d etcd is all tmpfs-backed and k3s is disk backed, enormous disk load increase for k3s makes perfect sense.

lively-minister-3764

02/08/2024, 10:10 PM

It should be sqlite, it's writing to an sqlite db

lively-minister-3764

02/08/2024, 10:11 PM

Increased memory on the server (4->8G) and everything works fine, now I'm interested to know how it fails the way it did...

rough-farmer-49135

02/09/2024, 1:40 AM

SQLite can point at a file, but you can also do it in-memory. I thought k3s did it in-memory and just wrote the file to disk backing when stopped or every few minutes, but I guess it works with the file directly.

creamy-pencil-82913

02/09/2024, 8:52 AM

If you have more than one server, you're using etcd. Sqlite is single server only. They both write to disk, but if that disk is tmpfs and not an actual physical disk then the IO profile will be different regardless of which one you're using.

2 Views

Open in Slack

Previous Next