I ve a curious thing happening in rancher when I add a worke Rancher Users #general

I've a curious thing happening in rancher when I a...

worried-state-78253

06/24/2025, 11:32 AM

I've a curious thing happening in rancher when I add a worker to an existing cluster. It looks like the DNS servers are not being configured correctly on the new VM, this is rancher 2.11.2, harvester 1.4.1, RKE2 cluster provisioned with open-suse-micro vms. When rancher adds the new worker it cant get the CA certificate because the DNS isn't resolving -

Copy code

241.460390] cloud-init[2201]: [ERRORI
000 received while downloading the CA certificate. Sleeping for 5 seconds and trying again
246.469395] cloud-init[2201]: curl: (6) Could not resolve host: rancher.web
246.470034] cloud-init[2201]: [ERROR]
000 received while downloading the CA certificate. Sleeping for 5 seconds and trying again
251.479164] cloud-init[2201]: curl: (6) Could not resolve host: rancher.web

Now we have two dedicated DNS servers in the building, one in the harvester cluster as a VM and one bare metal, and the router advertises these as the DNS servers to use. I've tried editing the cloudinit to specify the DNS and that didn't help, so reverted that. Currently I now have 2 workers trying to start stuck in a create/destroy loop as they cant see the rancher installation. The hardware nodes can all see the DNS fine and resolve, the existing nodes in this target cluster can all see the DNS - but the newly provisioned nodes can not? I'm at a bit of a loss at present to see what is causing this.

worried-state-78253

06/24/2025, 12:50 PM

We did recently add neuvector to the stack - have backed up so going to try removing it to see if its interfering with the network

worried-state-78253

06/24/2025, 1:01 PM

That made no differance, will try create a fresh cluster to see if that works.

worried-state-78253

06/24/2025, 1:35 PM

From kubectl on the management cluster I can see that rancher.web resolved to in internal IP

Copy code

> curl -v rancher.web
* Host rancher.web:80 was resolved.
* IPv6: (none)
* IPv4: 10.0.38.5
*   Trying 10.0.38.5:80...

I've restarted coredns in the local management cluster, and restarted rke2-coredns, from the latter I can see I can ping the address to the same internal IP. The new workers cant resolve the host however. Tested with creating a fresh cluster - single node cwe cluster, that worked, scaling by adding another node - and that also fails So I can create clusters - just cant add to them!

worried-state-78253

06/24/2025, 2:08 PM

Wow - now things are starting to work, I think the coredns services may take some time to sort themselves out. I removed sine if the problem nodes by choosing "scale down" on the specific node, though I'm not sure it does what it says - in that it'll choose another node to drop, anyhow new nodes are starting to work now. I added another pool to try that, and It does seem very unreliable though - 2 out of three in the new pool came up. Also a problem it seems I cant see a way to stop those machines trying to provision and scaling invariably means active nodes are taken out. Going though the DNS troubleshooting tips from the rancher site now.

worried-state-78253

06/24/2025, 3:12 PM

Copy code

root@m1:/# kubectl -n kube-system get pods -l k8s-app=kube-dns
NAME                       READY   STATUS    RESTARTS   AGE
coredns-546487b44f-xrsx2   1/1     Running   0          49m
root@m1:/# kubectl -n kube-system get svc -l k8s-app=kube-dns
NAME       TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)                  AGE
kube-dns   ClusterIP   10.43.0.10   <none>        53/UDP,53/TCP,9153/TCP   462d
root@m1:/# kubectl run -it --rm --restart=Never busybox --image=busybox:1.28 -- nslookup kubernetes.default
Server:    10.43.0.10
Address 1: 10.43.0.10 kube-dns.kube-system.svc.cluster.local

Name:      kubernetes.default
Address 1: 10.43.0.1 kubernetes.default.svc.cluster.local
pod "busybox" deleted
root@m1:/# kubectl run -it --rm --restart=Never busybox --image=busybox:1.28 -- nslookup rancher.web
Server:    10.43.0.10
Address 1: 10.43.0.10 kube-dns.kube-system.svc.cluster.local

Name:      rancher.web
Address 1: 10.0.38.5
pod "busybox" deleted

So far so good (i think) Trying the daemonset though the chart in the docs has a mistake in it - sleep: invalid number 'infinity'

Copy code

- image: busybox:1.28
        imagePullPolicy: Always
        name: alpine
        command: ["sleep", "infinity"]

Infinity isn't something busybox understands - try 2147483647.... Now testing those dnslookups -

Copy code

root@m1:/# export DOMAIN=<http://www.google.com|www.google.com>; echo "=> Start DNS resolve test"; kubectl get pods -l name=dnstest --no-headers -o custom-columns=NAME:.metadata.name,HOSTIP:.status.hostIP | while read pod host; do kubectl exec $pod -- /bin/sh -c "nslookup $DOMAIN > /dev/null 2>&1"; RC=$?; if [ $RC -ne 0 ]; then echo $host cannot resolve $DOMAIN; fi; done; echo "=> End DNS resolve test"
=> Start DNS resolve test
=> End DNS resolve test
root@m1:/# export DOMAIN=rancher.web; echo "=> Start DNS resolve test"; kubectl get pods -l name=dnstest --no-headers -o custom-columns=NAME:.metadata.name,HOSTIP:.status.hostIP | while read pod host; d
o kubectl exec $pod -- /bin/sh -c "nslookup $DOMAIN > /dev/null 2>&1"; RC=$?; if [ $RC -ne 0 ]; then echo $host cannot resolve $DOMAIN; fi; done; echo "=> End DNS resolve test"
=> Start DNS resolve test
=> End DNS resolve test

So that all passes.

Copy code

root@m1:/# kubectl run -i --restart=Never --rm test-${RANDOM} --image=ubuntu --overrides='{"kind":"Pod", "apiVersion":"v1", "spec": {"dnsPolicy":"Default"}}' -- sh -c 'cat /etc/resolv.conf'
nameserver 192.168.122.185
nameserver 192.168.125.3
pod "test-30854" deleted

Bingo - 192.168.125.3 doesn't exist... digging into this further -

Copy code

root@m1:/# resolvectl status
Global
           Protocols: -LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
    resolv.conf mode: stub
  Current DNS Server: 192.168.122.185
         DNS Servers: 192.168.122.185
Fallback DNS Servers: 192.168.122.124

Link 2 (eno1)
    Current Scopes: DNS
         Protocols: +DefaultRoute +LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
Current DNS Server: 192.168.125.3
       DNS Servers: 192.168.125.3

So these are coming from Link 2? I think at some point in the distant past I had a LB setup on this IP to point to a DNS service VM, for now I've recreated the LB and pointed it at a valid DNS server - but I'll need to dig into where this value was / is coming from... Router doesn't list them... Resolve on the harvester nodes does not list 125.3.... Some fun digging into this, anyway things are starting to look more healthy at the moment but will keep digging to get a finer point on this.

adamant-traffic-5372

06/24/2025, 3:24 PM

Happy to see you making progress on this @worried-state-78253. I've been watching this progress. Trying to resolve a bit of my own issues in a local dev env without much luck.

worried-state-78253

06/24/2025, 3:29 PM

@adamant-traffic-5372 This is a full size cluster were working on with 11 servers in total... This probably has no real relation to local dev, we've used docker-desktop and rancher-desktop for local dev - recommend getting in the habit of using npm to script build/up/down jobs so you can keep your local clean - using host mounts means it doesn't matter if you delete your charts ofc, thats how we work locally and switch project, use nginx ingress and no-ip to loop back to 127.0.0.1 and your golden 😉

adamant-traffic-5372

06/24/2025, 3:34 PM

Appreciate that @worried-state-78253. I think it does have something to do with ingress on the downstream k3d cluster. Just haven't pin pointed it yet.

worried-state-78253

06/24/2025, 3:46 PM

If its local the ingress is pretty easy with nginx, if your stuck - just forward the port to the service!

worried-state-78253

06/24/2025, 4:20 PM

My problems appear fixed at the moment, but its a hack and really I need to dig into resolve and find out where the strange address came from...

adamant-traffic-5372

06/24/2025, 4:23 PM

Yeah; I have found that DNS seems to resolve different in various circumstances. Have you checked the CoreDNS configmap. Is it defined in there?

worried-state-78253

06/24/2025, 4:25 PM

Its not that - its further up the chain, it is getting the DNS from the system which is also wrong in part, the resolver is getting a record from somewhere - you can see this in my debug output - the 125.3 address was not in service and comes from somewhere.... need to look up the chain but had enough for today and setting up the temporary fix gets my nodes provisioning again.

adamant-traffic-5372

06/24/2025, 4:29 PM

Good luck Craig.

worried-state-78253

06/25/2025, 9:42 AM

I believe I found the fix for my issue - which was the rancher management cluster nodes (bare metal) the wrong DNS was set for the main interface despite the resolvectl showing the default DNS servers as correct. I had to manually update the DNS for that specific interface on each of those hardware nodes.

Copy code

root@m1:~# resolvectl dns <INTERFACE> <YOUR-IPS>
root@m1:~# resolvectl status <INTERFACE>
Link 2 (<INTERFACE>)
Current Scopes: DNS
     Protocols: +DefaultRoute +LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
   DNS Servers: <YOUR-IPS>

adamant-traffic-5372

06/25/2025, 11:20 AM

Nice find. That seems to coincide with what I saw as well in my limited experience. At some point the order of precedence does fall into the machine's DNS definition, and in your case even to the interfaces definition.

3 Views

Open in Slack

Previous Next