I'm having a super inconstent issue where sometime...
# rke2
p
I'm having a super inconstent issue where sometimes, seemingly randomly, SOME dns requests in my pods will fail. It is super inconsistent, sometime a pod will fail to resolve a single domain for several seconds, some will fail just once and in worst case scenario, restarting the pod makes the resolving work again. I am using standard rke2 with calico, on debian 12 with systemd-resolved and netplan. I have seen several comments about setting --resolv-file as a kubelet argument, or disable the checksum thing for calico and vxlan, however thoses fixes are for permanent issues whereas mine are very inconsistent. Does someone has an idea? Of course if i do a manual dns request on my own, it works each time. Also it seems this is only affecting DNS (or udp as a whole?) as i don't seem to have other issues than randomly failing to resolve a hostname.... Any tips?
Sometimes in my log, for certain domains, coredns pods complain about
Copy code
[ERROR] plugin/errors: 2 my.domain. A: read udp 10.42.241.20:35645->8.8.8.8:53: i/o timeout
This doesnt happen for all failed domain resolution, but it sure doesnt look normal
https://github.com/kubernetes/dns/issues/480 it sounds very similar to this 😞
Also running dnsperf seems to yield about ~0.5% of lost requests. It seems instead of timeouts, the requests are just... refused ? They come by group
Copy code
Status] Command line: dnsperf -s 10.43.0.10 -d /opt/records.txt -c 1 -T 1 -l 30 -t 5 -Q 100000
[Status] Sending queries (to 10.43.0.10:53)
[Status] Started at: Fri May 23 13:38:50 2025
[Status] Stopping after 30.000000 seconds
[Timeout] Query timed out: msg id 2184
[Timeout] Query timed out: msg id 2185
[Timeout] Query timed out: msg id 2186
[Timeout] Query timed out: msg id 5459
[Timeout] Query timed out: msg id 5460
[Timeout] Query timed out: msg id 5461
[Timeout] Query timed out: msg id 5462
[Timeout] Query timed out: msg id 11606
[Timeout] Query timed out: msg id 11607
[Timeout] Query timed out: msg id 11608
[Timeout] Query timed out: msg id 11609
[Timeout] Query timed out: msg id 11610
[Timeout] Query timed out: msg id 11611
[Timeout] Query timed out: msg id 11612
[Timeout] Query timed out: msg id 25907
[Timeout] Query timed out: msg id 26579
[Timeout] Query timed out: msg id 26580
[Timeout] Query timed out: msg id 26581
[Timeout] Query timed out: msg id 26905
[Timeout] Query timed out: msg id 27659
[Timeout] Query timed out: msg id 27660
[Timeout] Query timed out: msg id 38327
[Timeout] Query timed out: msg id 38969
[Timeout] Query timed out: msg id 38970
[Timeout] Query timed out: msg id 38971
[Timeout] Query timed out: msg id 38972
[Timeout] Query timed out: msg id 39397
[Timeout] Query timed out: msg id 39398
[Timeout] Query timed out: msg id 49515
[Timeout] Query timed out: msg id 49516
[Timeout] Query timed out: msg id 56731
[Timeout] Query timed out: msg id 56732
[Timeout] Query timed out: msg id 56733
[Timeout] Query timed out: msg id 56734
[Timeout] Query timed out: msg id 56735
[Timeout] Query timed out: msg id 64365
[Timeout] Query timed out: msg id 64366
[Timeout] Query timed out: msg id 8318
[Timeout] Query timed out: msg id 8319
[Timeout] Query timed out: msg id 8320
[Timeout] Query timed out: msg id 8321
[Timeout] Query timed out: msg id 8322
[Timeout] Query timed out: msg id 14376
[Timeout] Query timed out: msg id 14377
[Status] Testing complete (time limit)
Statistics:
  Queries sent:         87223
  Queries completed:    87179 (99.95%)
  Queries lost:         44 (0.05%)
I can also reproduce it on my test cluster 😕 running localnode dns fixes the problem... only on subsequent dnsperf runs. The first run will contain some fails, and afterwards 100% of them will pass.... it really sounds like i'm having udp failures accross my rke2 clusters 😞
s
Did you enable, NodeLocal DNS or or just core DNS? CAn you try changing the forward DNS from 8.8.8.8 to 1.1.1.1 (cloudflare) and see the change. this is to ensure, there no network blocking or google ratelimit for your public IP. Enabling NodeLocalDNS will reduce the external hits as it caches the DNS. Also, you can try enabling caching on coreDNS for 60 seconds(for eg) to reduce upstream hits https://kubernetes.io/docs/tasks/administer-cluster/nodelocaldns/
p
@straight-actor-37028 Holy hell sorry i never seen you answered me ! I was just gonna say that enabling this weekend the nodelocal dns on my production cluster reduced my problems massively (but not entirely it seems?), i can have 5 dnsperf pods doing a tons of request without dropping a single one. Also the coredns pods have no errors (so far). It seems more to mask the underlying problem than properly fixing it, but oh well good enough. Thanks for your answer !