I'm looking for some help with a performance issue...
# k3s
c
I'm looking for some help with a performance issue. I have an application that is running very slowly when I run it as a pod in k3s, but if I run the same workload using docker directly on the host machine, it runs way faster... like the workload takes <1m using docker vs 2.5hrs when run inside a pod. Any advice on where to even start looking?
the application in question is
prowler
, which is launching a bunch of async tasks via some python library
m
Do you have resource limits set? get a dump of the deployment and/or pod manifest
c
I have no resource limits on the pod
m
how about resource request?
c
None, the
Resources
on the container is just an empty object (pulling manifest momentarily)
m
is it a single host running k3s?
c
yes
m
try to put resource requests so that it will get some guaranteed cpu and memory
๐Ÿ‘ 1
c
will try that now and report
Even with the configured resources, it's running quite slowly Here is the manifest
giving it 3 CPU and 1Gi of memory, still not running as fast when just run with docker
top pod
is reporting that it's just not using any CPU anyway
m
maybe take a look at docker stats and see how much it is using and match that
c
it goes to 100% for a bit at the start, and then jumps around 25-75% for a while
ends up using ~480M of memory
It's doing a lot of network calls, could be that slowing it down?
ya, it ends up pulling down like 7MB worth of JSON data from the aws API, so I'm assuming it's just rapid-firing those network calls
would something in the k3s networking be slowing that down?
m
core-dns maybe, try to see if you can resolve urls quickly inside a container running in k3s, do you have a lot of DNS look ups?
c
it's all calls to the AWS API, so I would think it would only be resolving a handful of times and then have it cached, no? (I'm not too savvy on core-dns itself)
I don't see any slowdown in general DNS resolution when using dig and curl inside an
alpine
container running in k3s
running
apk update
did take a while to complete though, longer than I would expect
yeah,
apk update
in docker returns super fast
m
are you multi-homed with your host?
wired ? wifi?
c
I don't think so?
the host is an EC2 instance running debian
m
If you're able to experiment try running k3s with docker shim or re-install k3s with
curl -sfL <https://get.k3s.io> | sh -s - --docker
you may get some traction, or change flannel to calico for the cni
c
I can add that
--docker
flag by just updating the systemd service file, right?
m
yeah you can try it
c
hm, neither of those options seems to be fixing the issue at hand... but this has officially become a Monday problem
๐Ÿ‘ 1
m
try using the latest AL2023 C5 instances and see if it makes any difference
c
@millions-dusk-51992 any recommendation between kernel 6.1 and 6.12?
m
6.12 I think is LTS go for that, unless you're super dependent on 6.1 for any specific reason
โœ… 1
c
Any opinions on AMD vs Intel for this?
m
Personally AMD but shouldn't be any different
โœ… 1
If you want to test if it's a k3s problem, maybe deploy EKS that can scale to zero and try it there. Rather than building an environment from scratch.
c
it's mostly automated anyway, just needed to tweak the setup script to use dnf instead of apt
๐Ÿ‘ 1
m
were you able to fix the performance problem?
c
OK, finally confirmed it is an issue with the k3s setup. I got a cluster running with
kops
using callium for the CNI and the workload in question runs in the expected timeframe
running on the latest c5a.large instance did not fix it, nor did using
--docker
for the runtime
I also tried switching out the CNI in k3s to use calico
which also did not work
m
nice! don't know if it were related to cgroupv1 vs v2 but I seem to recall encountering that
c
(as an aside.. I'd never actually tried to deploy EKS until trying to debug this... what a pain in the ass it is...)
nice! don't know if it were related to cgroupv1 vs v2 but I seem to recall encountering that
Sorry, not sure what you mean. Is this something that I can solve with k3s, or do I need to stick with an alternative for this workload?
m
depending on OS you're using, i remembering in the past encountering an issue with cgroup v1 vs v2. Which OS distro were you using?
c
I tried on both debian and AL2023
m
ok same problem?
c
yes
m
what version of k3s did you use?
c
it was the lastest when I installed
Copy code
curl -sfL <https://get.k3s.io> > /tmp/k3s
          chmod +x /tmp/k3s
          /tmp/k3s \
            --write-kubeconfig-mode 640 \
            --write-kubeconfig-group adm \
            --cluster-cidr 10.44.0.0/16 \
            --node-name ${pInstanceHostname} \
            --with-node-id
the machine with the initial problem was deployed last month, then I deployed again when I ran the test on a c5a.large with AL2023
m
do you still have that machine or is it gone?
c
I have it, just shut it down for a bit..
one sec
m
If it works with kops, just stick with it since Cilium CNI is better
c
I know, I just wanted to have everything easily on one node
I don't actually need a cluster, I just prefer k8s as an orchestration tool
m
yep, you can also use kind or minikube as single node which uses just docker. Did you endup running on same instance type C5a.large with AL2023?
c
yes
m
Oh you're running the k3s binary under /tmp ? don't know if that has an effect, you didn't run the install?
c
Copy code
curl -sfL <https://get.k3s.io> > /tmp/k3s
the installer script is just being written to
/tmp/k3s
m
ok got it