https://rancher.com/ logo
Title
n

nutritious-oxygen-89191

04/04/2023, 6:47 AM
Hey all, I have issues with my DNS and I follow [this](https://ranchermanager.docs.rancher.com/v2.5/troubleshooting/other-troubleshooting-tips/dns) Troubleshooting guide. This is the output of the loop to test FQDN resolution. There are three manager nodes in my cluster and two of them cannot resolve google.com. Unfortunately the docs only say
had the UDP ports blocked
but I opened all the ports mentioned [here](https://ranchermanager.docs.rancher.com/v2.5/getting-started/installation-and-upgrade/installation-requirements/port-requirements#downstr[…]er-nodes)
=> Start DNS resolve test
E0404 08:32:58.066969  105106 memcache.go:287] couldn't get resource list for <http://metrics.k8s.io/v1beta1|metrics.k8s.io/v1beta1>: the server is currently unable to handle the request
E0404 08:32:58.153873  105106 memcache.go:121] couldn't get resource list for <http://metrics.k8s.io/v1beta1|metrics.k8s.io/v1beta1>: the server is currently unable to handle the request
E0404 08:32:58.181868  105106 memcache.go:121] couldn't get resource list for <http://metrics.k8s.io/v1beta1|metrics.k8s.io/v1beta1>: the server is currently unable to handle the request
E0404 08:32:58.202611  105106 memcache.go:121] couldn't get resource list for <http://metrics.k8s.io/v1beta1|metrics.k8s.io/v1beta1>: the server is currently unable to handle the request
E0404 08:32:58.377407  105191 memcache.go:287] couldn't get resource list for <http://metrics.k8s.io/v1beta1|metrics.k8s.io/v1beta1>: the server is currently unable to handle the request
E0404 08:32:58.439348  105191 memcache.go:121] couldn't get resource list for <http://metrics.k8s.io/v1beta1|metrics.k8s.io/v1beta1>: the server is currently unable to handle the request
E0404 08:32:58.459940  105191 memcache.go:121] couldn't get resource list for <http://metrics.k8s.io/v1beta1|metrics.k8s.io/v1beta1>: the server is currently unable to handle the request
command terminated with exit code 1
<http://XXX.XXX.XXX.XXX|XXX.XXX.XXX.XXX> cannot resolve <http://www.google.com|www.google.com>
E0404 08:33:58.886092  107876 memcache.go:287] couldn't get resource list for <http://metrics.k8s.io/v1beta1|metrics.k8s.io/v1beta1>: the server is currently unable to handle the request
E0404 08:33:58.941918  107876 memcache.go:121] couldn't get resource list for <http://metrics.k8s.io/v1beta1|metrics.k8s.io/v1beta1>: the server is currently unable to handle the request
E0404 08:33:58.959819  107876 memcache.go:121] couldn't get resource list for <http://metrics.k8s.io/v1beta1|metrics.k8s.io/v1beta1>: the server is currently unable to handle the request
E0404 08:33:59.759025  107929 memcache.go:287] couldn't get resource list for <http://metrics.k8s.io/v1beta1|metrics.k8s.io/v1beta1>: the server is currently unable to handle the request
E0404 08:33:59.839680  107929 memcache.go:121] couldn't get resource list for <http://metrics.k8s.io/v1beta1|metrics.k8s.io/v1beta1>: the server is currently unable to handle the request
E0404 08:33:59.857799  107929 memcache.go:121] couldn't get resource list for <http://metrics.k8s.io/v1beta1|metrics.k8s.io/v1beta1>: the server is currently unable to handle the request
command terminated with exit code 1
YYY.YYY.YYY.YYY cannot resolve <http://www.google.com|www.google.com>
=> End DNS resolve test
c

creamy-pencil-82913

04/04/2023, 7:18 AM
If you're talking about k3s, you should probably read the k3s docs: https://docs.k3s.io/installation/requirements#networking
n

nutritious-oxygen-89191

04/05/2023, 7:24 AM
Thanks, I went back to the k3s docs and resolved a couple of issues related to
firewalld
and
iptables
. Besides the DNS test mentioned above I also tried the [Overlay test](https://ranchermanager.docs.rancher.com/troubleshooting/other-troubleshooting-tips/networking) - both still fail. Is there a k3s specific troubleshooting guide that I can follow?
c

creamy-pencil-82913

04/05/2023, 9:58 AM
have you tried just disabling firewalld and any custom iptables drop/reject rules you have in place just to confirm that it’s not something in your FW config?
n

nutritious-oxygen-89191

04/05/2023, 11:26 AM
yep. I even did
apt purge iptables
to make sure. ufw is inactive. and it seems there is also an issue with the metrics-server which fails with
panic: failed to create listener: failed to listen on 0.0.0.0:10250: listen tcp 0.0.0.0:10250: bind: address already in use
something is wrong with my networking, but I am not sure where to start troubleshooting
c

creamy-pencil-82913

04/05/2023, 3:13 PM
Hmm, that is odd. Metrics-server doesn't run with host network so there shouldn't be anything conflicting with that port
n

nutritious-oxygen-89191

04/06/2023, 7:09 AM
I resolved the issue with the metrics server. for some reason it was set to
hostnetwork: true
. the DNS and overlay network issue persists
I tried to go trough the
overlaytest
step by step and if I do
kubectl --request-timeout='10s' exec overlaytest-dbn8m -c overlaytest -- /bin/sh -c "ping -c2 gi-rm1 > /dev/null 2>&1"
I will get
Error from server: error dialing backend: x509: certificate is valid for 127.0.0.1, not 217.160.45.186
Independent of the overlay test I have created 4 pods (one on each node) based on the
jessie-dnsutils:1.3
image to run the ping command from there
my pods are called box0 etc.
kubectl --request-timeout='10s' exec -n whatwhatwhy box0-5675664f8d-ljwvf -- /bin/sh -c "ping -c2 gi-rm1"
ping by hostname
gi-rm1
fails with
PING gi-rm1 (<http://XXX.XXX.XXX.XXX|XXX.XXX.XXX.XXX>) 56(84) bytes of data.

--- gi-rm1 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 1009ms

command terminated with exit code
ping by IP works works for some nodes and fails for others. my guess is the name resolution does not work
correction. the three manager nodes are
gi-rm0
,
gi-rm1
, and
gi-rm2
. the worker is
geo-node1.
ping by IP among the 3 manager nodes works, ping by name does not. ping from manager to worker node by IP fails, but works by name
kubectl --request-timeout='10s' exec -n whatwhatwhy box0-5675664f8d-ljwvf -- /bin/sh -c "ping -c2 geo-node1"
. ping from worker to manager by name returns e.g.
unknown host gi-rm0
and fails with
100% packet loss
when using IP.
nslookup
is like that:
$ kubectl --request-timeout='10s' exec -n whatwhatwhy box0-5675664f8d-ljwvf -- /bin/sh -c "nslookup <http://www.google.com|www.google.com>"
Server:		10.43.0.10
Address:	10.43.0.10#53

Non-authoritative answer:
Name:	<http://www.google.com|www.google.com>
Address: 142.250.185.164
$ kubectl --request-timeout='10s' exec -n whatwhatwhy box1-78c59894bc-5cztq -- /bin/sh -c "nslookup <http://www.google.com|www.google.com>"
Server:		10.43.0.10
Address:	10.43.0.10#53

Non-authoritative answer:
Name:	<http://www.google.com|www.google.com>
Address: 142.250.185.164
$ kubectl --request-timeout='10s' exec -n whatwhatwhy box2-5d49b9d674-jv2pg -- /bin/sh -c "nslookup <http://www.google.com|www.google.com>"
Error from server: error dialing backend: x509: certificate is valid for 127.0.0.1, not <http://XXX.XXX.XXX.XXX|XXX.XXX.XXX.XXX>
$ kubectl --request-timeout='10s' exec -n whatwhatwhy geo-box1-74ccd9cfbd-wj5d5 -- /bin/sh -c "nslookup <http://www.google.com|www.google.com>"
;; connection timed out; no servers could be reached

command terminated with exit code 1