Full logs of K3s master node 1
# k3s
t
Full logs of K3s master node 1
Copy code
prod-agent-app-blue-0-f0380b7a       Ready    <none>                 375d   v1.30.3+k3s1
prod-agent-app-blue-1-a234b216       Ready    <none>                 375d   v1.30.3+k3s1
prod-agent-app-blue-10-1ef6ca3c      Ready    <none>                 46d    v1.30.3+k3s1
prod-agent-app-blue-11-2dccff04      Ready    <none>                 46d    v1.30.3+k3s1
prod-agent-app-blue-2-9c964185       Ready    <none>                 375d   v1.30.3+k3s1
prod-agent-app-blue-3-aef87d06       Ready    <none>                 375d   v1.30.3+k3s1
prod-agent-app-blue-4-79b5072b       Ready    <none>                 375d   v1.30.3+k3s1
prod-agent-app-blue-5-bb69767d       Ready    <none>                 375d   v1.30.3+k3s1
prod-agent-app-blue-6-cd93959b       Ready    <none>                 375d   v1.30.3+k3s1
prod-agent-app-blue-7-c3860444       Ready    <none>                 375d   v1.30.3+k3s1
prod-agent-app-blue-8-49a23f58       Ready    <none>                 179d   v1.30.3+k3s1
prod-agent-app-blue-9-63cdf3ff       Ready    <none>                 46d    v1.30.3+k3s1
prod-agent-data-green-0-9b98de05     Ready    <none>                 33d    v1.30.14+k3s2
prod-agent-data-green-1-efd5b590     Ready    <none>                 33d    v1.30.14+k3s2
prod-agent-data-green-3-47319b7a     Ready    <none>                 33d    v1.30.14+k3s2
prod-agent-data-green-4-96deb91b     Ready    <none>                 33d    v1.30.14+k3s2
prod-agent-data-green-5-f52e9ef0     Ready    <none>                 33d    v1.30.14+k3s2
prod-agent-data-green-6-420706e2     Ready    <none>                 33d    v1.30.14+k3s2
prod-agent-data-green-7-93605edd     Ready    <none>                 33d    v1.30.14+k3s2
prod-agent-ingress-blue-0-ac9168ff   Ready    <none>                 375d   v1.30.3+k3s1
prod-agent-ingress-blue-1-4323cca2   Ready    <none>                 375d   v1.30.3+k3s1
prod-agent-ingress-blue-2-39068466   Ready    <none>                 375d   v1.30.3+k3s1
prod-agent-ops-blue-0-4b80a6a8       Ready    <none>                 375d   v1.30.3+k3s1
prod-agent-ops-blue-1-89a427cc       Ready    <none>                 375d   v1.30.3+k3s1
prod-server-blue-k3s-00              Ready    control-plane,master   375d   v1.30.14+k3s2
prod-server-blue-k3s-01              Ready    control-plane,master   375d   v1.30.14+k3s2
cc @creamy-pencil-82913 kindly help on identifying the reason behind this.. (screenshots of postgres db & k3s server metrics past 12hours)
I see a lot of
Copy code
level=info msg="Slow SQL (started:
Untitled
There seems to be a problem with the postgres disk.. here's the FIO results
Copy code
root@k3s-db:~# sudo fio --name=write_throughput --directory=$TEST_DIR --numjobs=8 \
--size=10G --time_based --runtime=60s --ramp_time=2s --ioengine=libaio \
--direct=1 --verify=0 --bs=1M --iodepth=64 --rw=write \
--group_reporting=1
write_throughput: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=64
...
fio-3.28
Starting 8 processes
write_throughput: Laying out IO file (1 file / 10240MiB)
write_throughput: Laying out IO file (1 file / 10240MiB)
write_throughput: Laying out IO file (1 file / 10240MiB)
write_throughput: Laying out IO file (1 file / 10240MiB)
write_throughput: Laying out IO file (1 file / 10240MiB)
write_throughput: Laying out IO file (1 file / 10240MiB)
write_throughput: Laying out IO file (1 file / 10240MiB)
write_throughput: Laying out IO file (1 file / 10240MiB)
Jobs: 8 (f=8): [W(8)][36.8%][w=504MiB/s][w=504 IOPS][eta 01m:48s]
write_throughput: (groupid=0, jobs=8): err= 0: pid=1658672: Tue Sep  9 05:41:46 2025
  write: IOPS=487, BW=496MiB/s (520MB/s)(29.3GiB/60483msec); 0 zone resets
    slat (usec): min=49, max=831355, avg=16248.41, stdev=35157.16
    clat (msec): min=84, max=2213, avg=1024.98, stdev=246.99
     lat (msec): min=112, max=2221, avg=1041.27, stdev=249.06
    clat percentiles (msec):
     |  1.00th=[  359],  5.00th=[  634], 10.00th=[  760], 20.00th=[  860],
     | 30.00th=[  911], 40.00th=[  961], 50.00th=[ 1011], 60.00th=[ 1070],
     | 70.00th=[ 1133], 80.00th=[ 1200], 90.00th=[ 1318], 95.00th=[ 1435],
     | 99.00th=[ 1703], 99.50th=[ 1854], 99.90th=[ 2056], 99.95th=[ 2123],
     | 99.99th=[ 2165]
   bw (  KiB/s): min=86024, max=1287091, per=99.30%, avg=504406.21, stdev=22453.54, samples=958
   iops        : min=   84, max= 1256, avg=491.44, stdev=21.93, samples=958
  lat (msec)   : 100=0.01%, 250=0.44%, 500=2.37%, 750=6.72%, 1000=38.75%
  lat (msec)   : 2000=53.22%, >=2000=0.20%
  cpu          : usr=0.52%, sys=0.48%, ctx=10702, majf=0, minf=466
  IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=0,29492,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
  WRITE: bw=496MiB/s (520MB/s), 496MiB/s-496MiB/s (520MB/s-520MB/s), io=29.3GiB (31.5GB), run=60483-60483msec

Disk stats (read/write):
    dm-0: ios=22/32864, merge=0/0, ticks=480/11131308, in_queue=11131788, util=99.91%, aggrios=22/46289, aggrmerge=0/2689, aggrticks=411/14916470, aggrin_queue=14916880, aggrutil=99.67%
  sda: ios=22/46289, merge=0/2689, ticks=411/14916470, in_queue=14916880, util=99.67%
Average completion latency: ~1025 ms
c
Yeah that's way too slow
1.30.3 is a little old, there have been a few improvements to kine since then, but nothing that's going to fix slow disk for you.
👍 1
Besides looking at disk throughout, you might check the logs and make sure compact is running every 5 minutes, and succeeding. If the disk is too slow that might be timing out, which would make things worse.
Oh wait you have some on 1.30.3 and some on 1.30.14? Why?
t
yes, the compact is running successfully every 5 mins but after sometime it fails with timeout errors (I don't see any pattern here at the moment, it seems to be random) regarding version mismatch, I had an issue with ISCSI on some of the data-node group. while fixing that issue, the script took the latest patch of 1.30 when rolling the new data-nodes group, but that was done a month ago. I did not find any issues since then, when I noticed the first k3s server restart few days ago, I upgraded the master nodes to be on the same patch version. but would that be a problem specifically in K3s ? as long as all the nodes uses the same minor version, we should be okay right ?
c
yeah but still, good idea to keep things up to date. the two patch releases are about a year apart.
1
does seem like most likely to be an issue with the backing disk. is it shared with something else? is it close to full?
t
okay, it is using vSAN datastore, I'm running this setup on VMWare, I have allocated 100G to postgres instance, currently it has sufficient disk space left (80G left). I will need to check with my Infra team on vSAN side of things for any anomalies..
I migrated the postgres VM onto a much better SSD backed appliance. have not seen any slow-sqls being logged in the k3s servers since then. thanks for your help @creamy-pencil-82913
👍 1