worried-state-78253
06/24/2025, 11:32 AM241.460390] cloud-init[2201]: [ERRORI
000 received while downloading the CA certificate. Sleeping for 5 seconds and trying again
246.469395] cloud-init[2201]: curl: (6) Could not resolve host: rancher.web
246.470034] cloud-init[2201]: [ERROR]
000 received while downloading the CA certificate. Sleeping for 5 seconds and trying again
251.479164] cloud-init[2201]: curl: (6) Could not resolve host: rancher.web
Now we have two dedicated DNS servers in the building, one in the harvester cluster as a VM and one bare metal, and the router advertises these as the DNS servers to use.
I've tried editing the cloudinit to specify the DNS and that didn't help, so reverted that.
Currently I now have 2 workers trying to start stuck in a create/destroy loop as they cant see the rancher installation.
The hardware nodes can all see the DNS fine and resolve, the existing nodes in this target cluster can all see the DNS - but the newly provisioned nodes can not? I'm at a bit of a loss at present to see what is causing this.worried-state-78253
06/24/2025, 12:50 PMworried-state-78253
06/24/2025, 1:01 PMworried-state-78253
06/24/2025, 1:35 PM> curl -v rancher.web
* Host rancher.web:80 was resolved.
* IPv6: (none)
* IPv4: 10.0.38.5
* Trying 10.0.38.5:80...
I've restarted coredns in the local management cluster, and restarted rke2-coredns, from the latter I can see I can ping the address to the same internal IP.
The new workers cant resolve the host however.
Tested with creating a fresh cluster - single node cwe cluster, that worked, scaling by adding another node - and that also fails
So I can create clusters - just cant add to them!worried-state-78253
06/24/2025, 2:08 PMworried-state-78253
06/24/2025, 3:12 PMroot@m1:/# kubectl -n kube-system get pods -l k8s-app=kube-dns
NAME READY STATUS RESTARTS AGE
coredns-546487b44f-xrsx2 1/1 Running 0 49m
root@m1:/# kubectl -n kube-system get svc -l k8s-app=kube-dns
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kube-dns ClusterIP 10.43.0.10 <none> 53/UDP,53/TCP,9153/TCP 462d
root@m1:/# kubectl run -it --rm --restart=Never busybox --image=busybox:1.28 -- nslookup kubernetes.default
Server: 10.43.0.10
Address 1: 10.43.0.10 kube-dns.kube-system.svc.cluster.local
Name: kubernetes.default
Address 1: 10.43.0.1 kubernetes.default.svc.cluster.local
pod "busybox" deleted
root@m1:/# kubectl run -it --rm --restart=Never busybox --image=busybox:1.28 -- nslookup rancher.web
Server: 10.43.0.10
Address 1: 10.43.0.10 kube-dns.kube-system.svc.cluster.local
Name: rancher.web
Address 1: 10.0.38.5
pod "busybox" deleted
So far so good (i think)
Trying the daemonset though the chart in the docs has a mistake in it -
sleep: invalid number 'infinity'
- image: busybox:1.28
imagePullPolicy: Always
name: alpine
command: ["sleep", "infinity"]
Infinity isn't something busybox understands - try 2147483647....
Now testing those dnslookups -
root@m1:/# export DOMAIN=<http://www.google.com|www.google.com>; echo "=> Start DNS resolve test"; kubectl get pods -l name=dnstest --no-headers -o custom-columns=NAME:.metadata.name,HOSTIP:.status.hostIP | while read pod host; do kubectl exec $pod -- /bin/sh -c "nslookup $DOMAIN > /dev/null 2>&1"; RC=$?; if [ $RC -ne 0 ]; then echo $host cannot resolve $DOMAIN; fi; done; echo "=> End DNS resolve test"
=> Start DNS resolve test
=> End DNS resolve test
root@m1:/# export DOMAIN=rancher.web; echo "=> Start DNS resolve test"; kubectl get pods -l name=dnstest --no-headers -o custom-columns=NAME:.metadata.name,HOSTIP:.status.hostIP | while read pod host; d
o kubectl exec $pod -- /bin/sh -c "nslookup $DOMAIN > /dev/null 2>&1"; RC=$?; if [ $RC -ne 0 ]; then echo $host cannot resolve $DOMAIN; fi; done; echo "=> End DNS resolve test"
=> Start DNS resolve test
=> End DNS resolve test
So that all passes.
root@m1:/# kubectl run -i --restart=Never --rm test-${RANDOM} --image=ubuntu --overrides='{"kind":"Pod", "apiVersion":"v1", "spec": {"dnsPolicy":"Default"}}' -- sh -c 'cat /etc/resolv.conf'
nameserver 192.168.122.185
nameserver 192.168.125.3
pod "test-30854" deleted
Bingo - 192.168.125.3 doesn't exist... digging into this further -
root@m1:/# resolvectl status
Global
Protocols: -LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
resolv.conf mode: stub
Current DNS Server: 192.168.122.185
DNS Servers: 192.168.122.185
Fallback DNS Servers: 192.168.122.124
Link 2 (eno1)
Current Scopes: DNS
Protocols: +DefaultRoute +LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
Current DNS Server: 192.168.125.3
DNS Servers: 192.168.125.3
So these are coming from Link 2?
I think at some point in the distant past I had a LB setup on this IP to point to a DNS service VM, for now I've recreated the LB and pointed it at a valid DNS server - but I'll need to dig into where this value was / is coming from...
Router doesn't list them...
Resolve on the harvester nodes does not list 125.3....
Some fun digging into this, anyway things are starting to look more healthy at the moment but will keep digging to get a finer point on this.adamant-traffic-5372
06/24/2025, 3:24 PMworried-state-78253
06/24/2025, 3:29 PMadamant-traffic-5372
06/24/2025, 3:34 PMworried-state-78253
06/24/2025, 3:46 PMworried-state-78253
06/24/2025, 4:20 PMadamant-traffic-5372
06/24/2025, 4:23 PMworried-state-78253
06/24/2025, 4:25 PMadamant-traffic-5372
06/24/2025, 4:29 PMworried-state-78253
06/25/2025, 9:42 AMroot@m1:~# resolvectl dns <INTERFACE> <YOUR-IPS>
root@m1:~# resolvectl status <INTERFACE>
Link 2 (<INTERFACE>)
Current Scopes: DNS
Protocols: +DefaultRoute +LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
DNS Servers: <YOUR-IPS>
adamant-traffic-5372
06/25/2025, 11:20 AM