Hello, i'd be grateful for input to what might be...
# rke2
n
Hello, i'd be grateful for input to what might be wrong with creating a new RKE2 cluster on my first bare-metal node from within Rancher 2.9, resp. hints in how to further troubleshoot. Issue I create a new cluster from the Rancher UI and execute the registration command on the first node. RKE2 get's installed, but the cluster agent never connects back to rancher. The rancher provisioning log 'ends' with:
Copy code
(...)
[INFO ] configuring bootstrap node(s) custom-309c46378d42: waiting for cluster agent to connect
Checking
/var/lib/rancher/rke2/agent/logs/kubelet.log
on the node i see:
Copy code
(...)
kubelet_node_status.go:73] "Attempting to register node" node="mynode01"
kubelet_node_status.go:96] "Unable to register node with API server" err="Post \"<https://127.0.0.1:6443/api/v1/nodes>\": dial tcp 127.0.0.1:6443: connect: connection refused" node="mynode01"
(...)
"Failed to ensure lease exists, will retry" err="Get \"<https://127.0.0.1:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/mynode01?timeout=10s>\": dial tcp 127.0.0.1:6443: connect: connection refused" interval="6.4s"
(...)
"Error syncing pod, skipping" err="failed to \"StartContainer\" for \"cluster-register\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=cluster-register pod=cattle-cluster-agent-58cc5fbb8d-mrdvf_cattle-system(6bea22f6-c306-4885-bf9c-9ab485cc8e47)\"" pod="cattle-system/cattle-cluster-agent-58cc5fbb8d-mrdvf" podUID="6bea22f6-c306-4885-bf9c-9ab485cc8e47"
(...)
Environmental info - Rancher 2.9 - Cluster-Configuration
Copy code
spec:
  kubernetesVersion: v1.29.15+rke2r1
  localClusterAuthEndpoint: {}
  rkeConfig:
    (...)
    machineGlobalConfig:
      cluster-cidr: 100.194.0.0/16
      cluster-dns: 100.195.0.10
      cluster-domain: cluster.local
      cni: cilium
      disable-kube-proxy: false
      etcd-expose-metrics: false
      service-cidr: 100.195.0.0/16
      tls-san: []
      (...)
c
it is normal for the kubelet not to be able to reach the apiserver when it first starts, as the kubelet itself runs the apiserver as a static pod. You’ll need to provide more information, or look through the logs to find a more relevant error.
In general, I would look in /var/log/pods to confirm that etcd, kube-apiserver, kube-scheduler, and so on are running without errors.
also, check the output of
kubectl get pod -A -o wide
to see what all is running
p
if you also check status of the rke2-server.service and rancher-system-agent.service ? Is there any error in there?
n
thanks for your answers. no errors in both services. as of now i think the issue must lie within DNS.
kubectl logs
of my
cattle-cluster-agent
shows an error (it hangs in CrashLoopBack as well):
Copy code
#dnsPolicy: ClusterFirst
INFO: Using resolv.conf: search cattle-system.svc.cluster.local svc.cluster.local cluster.local <http://mydomain.com|mydomain.com> nameserver 10.43.0.10 options ndots:5
ERROR: <https://myrancher.mydomain.com/ping> is not accessible (Could not resolve host: <http://myrancher.mydomain.com|myrancher.mydomain.com>)

#dnsPolicy: Default
INFO: Using resolv.conf: search <http://mydomain.com|mydomain.com> nameserver 172.1.1.1 nameserver 172.1.1.2
ERROR: <https://myrancher.mydomain.com/ping> is not accessible (Could not resolve host: <http://myrancher.mydomain.com|myrancher.mydomain.com>)
Resolving
<http://myrancher.mydomain.com|myrancher.mydomain.com>
from another pod in kube-system works. found a few open/known issues related to DNS, where #16454 comes closest to what i seem to have. according to this i changed
/etc/hosts
and modified the core-dns configmap - in both cases adding
<http://myrancher.mydomain.com|myrancher.mydomain.com>
as host. i did try different configurations when creating the cluster as well (cilium & calico as cni, leave all network settings all to defaults). No change in behaviour until now.
state of pods in kube-system
Copy code
NAME                                                   READY   STATUS             RESTARTS         AGE
cilium-g8vdw                                           1/1     Running            0                61m
cilium-operator-6dfff79f58-4fxqx                       0/1     Pending            0                61m
cilium-operator-6dfff79f58-n562h                       1/1     Running            0                61m
cloud-controller-manager-rzlxdefdevrke01               1/1     Running            0                61m
etcd-rzlxdefdevrke01                                   1/1     Running            0                60m
helm-install-rke2-cilium-mqdht                         0/1     Completed          0                61m
helm-install-rke2-coredns-vp7lh                        0/1     Completed          0                61m
helm-install-rke2-ingress-nginx-mcm6s                  0/1     CrashLoopBackOff   13 (4m48s ago)   61m
helm-install-rke2-metrics-server-llsgm                 1/1     Running            14 (5m10s ago)   61m
helm-install-rke2-runtimeclasses-hwjzd                 1/1     Running            14 (5m25s ago)   61m
helm-install-rke2-snapshot-controller-crd-smxzk        0/1     CrashLoopBackOff   13 (5m ago)      61m
helm-install-rke2-snapshot-controller-pzsk6            0/1     CrashLoopBackOff   13 (4m57s ago)   61m
kube-apiserver-rzlxdefdevrke01                         1/1     Running            0                61m
kube-controller-manager-rzlxdefdevrke01                1/1     Running            0                61m
kube-proxy-rzlxdefdevrke01                             1/1     Running            0                60m
kube-scheduler-rzlxdefdevrke01                         1/1     Running            0                61m
rke2-coredns-rke2-coredns-6ff85b696b-d82zg             0/1     Running            0                72s
rke2-coredns-rke2-coredns-autoscaler-6898c795f-884wh   1/1     Running            18 (5m36s ago)   61m
p
coredns being dead may not help the cluster agent to resolve hosts
n
it says
Copy code
maxprocs: Updating GOMAXPROCS=1: using minimum allowed GOMAXPROCS
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[WARNING] plugin/kubernetes: starting server with unsynced Kubernetes API
.:53
[INFO] plugin/reload: Running configuration SHA512 = c18591e7950724fe7f26bd172b7e98b6d72581b4a8fc4e5fc4cfd08229eea58f4ad043c9fd3dbd1110a11499c4aa3164cdd63ca0dd5ee59651d61756c4f671b7
CoreDNS-1.12.0
linux/amd64, go1.22.8 X:boringcrypto, 51e11f166
[INFO] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.31.2/tools/cache/reflector.go:243: failed to list *v1.Namespace: Get "<https://100.195.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0>": dial tcp 100.195.0.1:443: i/o timeout
[ERROR] plugin/kubernetes: Unhandled Error
[INFO] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.31.2/tools/cache/reflector.go:243: failed to list *v1.Service: Get "<https://100.195.0.1:443/api/v1/services?limit=500&resourceVersion=0>": dial tcp 100.195.0.1:443: i/o timeout
[ERROR] plugin/kubernetes: Unhandled Error
[INFO] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.31.2/tools/cache/reflector.go:243: failed to list *v1.EndpointSlice: Get "<https://100.195.0.1:443/apis/discovery.k8s.io/v1/endpointslices?limit=500&resourceVersion=0>": dial tcp 100.195.0.1:443: i/o timeout
[ERROR] plugin/kubernetes: Unhandled Error
p
Seems theres a CNI issue somewhere, good luck 😮
c
You disabled kube-proxy. How are you expecting to reach service clusterip addresses without kube-proxy?
p
did he? it says : disable-kube-proxy: false
c
ah right, sorry I was on mobile and the wrapping makes it hard to read. false is the default so usually there is no reason to set it if it’s false. Since this is a single-node cluster things should be pretty simple. Is there perhaps a local firewall on this node that needs to be disabled?
n
thanks for both your input. will be able to look more deeply into it next week.