This message was deleted.
# rke2
a
This message was deleted.
c
have you isolated it to anything in particular? lookups for a particular resource or tld, lookups between pods on different nodes, or so on?
b
Yes, I'm specifically trying to do lookups for node-X.zerotesting-service.zerotesting.svc.cluster.local
where X is from 0 to 999 (or however many are actively running in that service and alive and healthy)
c
The most frequent cause of dropped traffic is something affecting CNI traffic between nodes. for example, dns requests fail when the lookup from node A hits coredns pod on node C. works fine if the lookup is made against a coredns pod on a different node.
can you scale coredns down to a single replica, and then test lookups pods from different nodes?
b
👍
c
also confirm that you’re not flooding coredns with more traffic than it can handle. By default coredns is limited to 100m/128mb. If you are hitting it with a ton of traffic, you may need to raise those limits.
b
Yep, it's a relatively modest load, probably about 1-10 qps
Oh
Also, it's not just dropped traffic @creamy-pencil-82913
weirdly it's actually NXDOMAIN-ing for lookups that it shouldn't
and this is basically bone stock RKE2 CoreDNS
The only config thing I changed was adding the "log" directive to the configmap (and the behaviour occurs even without that)
I've confirmed through a static pod set to sleep infinity, and being able to edit /etc/resolv.conf to point at a particular CoreDNS pod directly, that these failures occur with all of the CoreDNS pods
regardless of host
It's always weirdly one in 5.5 to 6.5 queries. Always.
c
oh, hm. what kind of resource should that point at? a service, a pod, something else?
b
(On average).
It's a StatefulSet with 2600 pods in it, and a HeadlessService that points at the pods.
so that we can resolve pod-name.servicename
c
is the coredns pod getting restarted? how is its resource utilization?
b
No, it's stable. Resource usage is low, and it's tweaked to have higher limits anyways
c
how old is the pod whose name you are trying to resolve? Is it just not resolving them immediately after creation?
b
Very old at this point, 30m+
It's currently a fairly stable statefulset of 2600 which hasn't been scaled up or down in at least 10 minutes and most of the pods are 30m+ old
c
hm. You might edit the coredns deployment to turn up the coredns log level and see if it has any hints why it’s claiming nxdomain
and/or check at https://github.com/coredns/coredns/ ands see if there are any similar reports
b
(Also JIC I upgraded to latest CoreDNS, still seeing it)
c
hmm. do you see anything different if you bypass the coredns service and hit a single coredns pod by IP?
b
I did find this old issue a few days ago - https://github.com/coredns/coredns/issues/1365
And no, that's one of the things I've been doing to test it. Both service + coredns pod directly give the same results.
the consistency is almost remarkable.
c
sounds like a coredns bug then
b
Needless to say this is driving me absolutely nuts
Is there an alternative DNS service for RKE2?
c
for Kubernetes?
b
Well, for Kubernetes, yes, but specifically is there one that integrates nicely with RKE2
c
not that we bundle, no. coredns is kinda the de-facto dns service for Kubernetes. the kube-dns project is still maintained but I don’t ever see anyone using it.
its kinda old-school, it’s a wrapper around dnsmasq IIRC
never used it myself