This message was deleted Rancher Users #rke2

Join Slack

This message was deleted.

# rke2

adamant-kite-43734

03/01/2024, 10:25 PM

This message was deleted.

creamy-pencil-82913

03/01/2024, 10:31 PM

have you isolated it to anything in particular? lookups for a particular resource or tld, lookups between pods on different nodes, or so on?

busy-crowd-80458

03/01/2024, 10:31 PM

Yes, I'm specifically trying to do lookups for node-X.zerotesting-service.zerotesting.svc.cluster.local

busy-crowd-80458

03/01/2024, 10:31 PM

where X is from 0 to 999 (or however many are actively running in that service and alive and healthy)

creamy-pencil-82913

03/01/2024, 10:31 PM

The most frequent cause of dropped traffic is something affecting CNI traffic between nodes. for example, dns requests fail when the lookup from node A hits coredns pod on node C. works fine if the lookup is made against a coredns pod on a different node.

creamy-pencil-82913

03/01/2024, 10:33 PM

can you scale coredns down to a single replica, and then test lookups pods from different nodes?

busy-crowd-80458

03/01/2024, 10:33 PM

👍

creamy-pencil-82913

03/01/2024, 10:34 PM

also confirm that you’re not flooding coredns with more traffic than it can handle. By default coredns is limited to 100m/128mb. If you are hitting it with a ton of traffic, you may need to raise those limits.

creamy-pencil-82913

03/01/2024, 10:34 PM

https://github.com/rancher/rke2-charts/blob/main/charts/rke2-coredns/rke2-coredns/1.29.001/values.yaml#L20-L26

busy-crowd-80458

03/01/2024, 10:34 PM

Yep, it's a relatively modest load, probably about 1-10 qps

busy-crowd-80458

03/01/2024, 10:38 PM

busy-crowd-80458

03/01/2024, 10:38 PM

Also, it's not just dropped traffic @creamy-pencil-82913

busy-crowd-80458

03/01/2024, 10:38 PM

weirdly it's actually NXDOMAIN-ing for lookups that it shouldn't

busy-crowd-80458

03/01/2024, 10:38 PM

and this is basically bone stock RKE2 CoreDNS

busy-crowd-80458

03/01/2024, 10:38 PM

The only config thing I changed was adding the "log" directive to the configmap (and the behaviour occurs even without that)

busy-crowd-80458

03/01/2024, 10:40 PM

I've confirmed through a static pod set to sleep infinity, and being able to edit /etc/resolv.conf to point at a particular CoreDNS pod directly, that these failures occur with all of the CoreDNS pods

busy-crowd-80458

03/01/2024, 10:40 PM

regardless of host

busy-crowd-80458

03/01/2024, 10:42 PM

It's always weirdly one in 5.5 to 6.5 queries. Always.

creamy-pencil-82913

03/01/2024, 10:42 PM

oh, hm. what kind of resource should that point at? a service, a pod, something else?

busy-crowd-80458

03/01/2024, 10:42 PM

(On average).

busy-crowd-80458

03/01/2024, 10:42 PM

It's a StatefulSet with 2600 pods in it, and a HeadlessService that points at the pods.

busy-crowd-80458

03/01/2024, 10:43 PM

so that we can resolve pod-name.servicename

creamy-pencil-82913

03/01/2024, 10:43 PM

is the coredns pod getting restarted? how is its resource utilization?

busy-crowd-80458

03/01/2024, 10:43 PM

No, it's stable. Resource usage is low, and it's tweaked to have higher limits anyways

creamy-pencil-82913

03/01/2024, 10:43 PM

how old is the pod whose name you are trying to resolve? Is it just not resolving them immediately after creation?

busy-crowd-80458

03/01/2024, 10:43 PM

Very old at this point, 30m+

busy-crowd-80458

03/01/2024, 10:44 PM

It's currently a fairly stable statefulset of 2600 which hasn't been scaled up or down in at least 10 minutes and most of the pods are 30m+ old

creamy-pencil-82913

03/01/2024, 10:44 PM

hm. You might edit the coredns deployment to turn up the coredns log level and see if it has any hints why it’s claiming nxdomain

creamy-pencil-82913

03/01/2024, 10:45 PM

and/or check at https://github.com/coredns/coredns/ ands see if there are any similar reports

busy-crowd-80458

03/01/2024, 10:45 PM

(Also JIC I upgraded to latest CoreDNS, still seeing it)

creamy-pencil-82913

03/01/2024, 10:45 PM

hmm. do you see anything different if you bypass the coredns service and hit a single coredns pod by IP?

busy-crowd-80458

03/01/2024, 10:46 PM

I did find this old issue a few days ago - https://github.com/coredns/coredns/issues/1365

busy-crowd-80458

03/01/2024, 10:46 PM

And no, that's one of the things I've been doing to test it. Both service + coredns pod directly give the same results.

busy-crowd-80458

03/01/2024, 10:46 PM

the consistency is almost remarkable.

creamy-pencil-82913

03/01/2024, 10:46 PM

sounds like a coredns bug then

busy-crowd-80458

03/01/2024, 10:46 PM

Needless to say this is driving me absolutely nuts

busy-crowd-80458

03/01/2024, 10:46 PM

Is there an alternative DNS service for RKE2?

creamy-pencil-82913

03/01/2024, 10:50 PM

for Kubernetes?

busy-crowd-80458

03/01/2024, 10:50 PM

Well, for Kubernetes, yes, but specifically is there one that integrates nicely with RKE2

creamy-pencil-82913

03/01/2024, 10:51 PM

not that we bundle, no. coredns is kinda the de-facto dns service for Kubernetes. the kube-dns project is still maintained but I don’t ever see anyone using it.

creamy-pencil-82913

03/01/2024, 10:51 PM

https://github.com/kubernetes/dns

creamy-pencil-82913

03/01/2024, 10:52 PM

its kinda old-school, it’s a wrapper around dnsmasq IIRC

creamy-pencil-82913

03/01/2024, 10:52 PM

never used it myself

Open in Slack

Previous Next