https://rancher.com/ logo
#rke2
Title
s

stale-painting-80203

03/14/2023, 12:38 AM
Rancher v2.7.0 with downstream RKE2 cluster. Not able to import an orphaned cluster into a new instance of Rancher. Import Existing -> Import any Kubernetes cluster. After issuing the import command on the cluster several pods go into CrashLoop and do not recover:
Copy code
/var/lib/rancher/rke2/bin/kubectl         --kubeconfig /etc/rancher/rke2/rke2.yaml apply -f <https://rancher75182.senode.dev/v3/import/xhctfcnbbt56xvxh6jptq7lzvpw9svd2drkbj5pvm466t5r7zlplqv_c-m-zqcvzlgn.yaml>
<http://clusterrole.rbac.authorization.k8s.io/proxy-clusterrole-kubeapiserver|clusterrole.rbac.authorization.k8s.io/proxy-clusterrole-kubeapiserver> unchanged
<http://clusterrolebinding.rbac.authorization.k8s.io/proxy-role-binding-kubernetes-master|clusterrolebinding.rbac.authorization.k8s.io/proxy-role-binding-kubernetes-master> unchanged
namespace/cattle-system unchanged
serviceaccount/cattle unchanged
<http://clusterrolebinding.rbac.authorization.k8s.io/cattle-admin-binding|clusterrolebinding.rbac.authorization.k8s.io/cattle-admin-binding> unchanged
secret/cattle-credentials-ad9a794 created
<http://clusterrole.rbac.authorization.k8s.io/cattle-admin|clusterrole.rbac.authorization.k8s.io/cattle-admin> unchanged
Warning: spec.template.spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms[0].matchExpressions[0].key: <http://beta.kubernetes.io/os|beta.kubernetes.io/os> is deprecated since v1.14; use "<http://kubernetes.io/os|kubernetes.io/os>" instead
deployment.apps/cattle-cluster-agent configured
service/cattle-cluster-agent unchanged

NAMESPACE             NAME                                                    READY   STATUS             RESTARTS      AGE
calico-system         calico-kube-controllers-f75c97ff6-fvb66                 1/1     Running            0             19m
calico-system         calico-node-6vxmh                                       1/1     Running            0             19m
calico-system         calico-node-d9t8n                                       0/1     Running            0             17m
calico-system         calico-node-khhpr                                       1/1     Running            0             19m
calico-system         calico-node-nmcds                                       0/1     Running            0             17m
calico-system         calico-typha-d65458ffc-97pn9                            1/1     Running            0             17m
calico-system         calico-typha-d65458ffc-p9cj2                            1/1     Running            0             19m
cattle-fleet-system   fleet-agent-6c857b85b5-zff2l                            1/1     Running            0             17m
cattle-system         cattle-cluster-agent-6f588568-dj7ql                     0/1     CrashLoopBackOff   4 (49s ago)   4m9s
cattle-system         cattle-cluster-agent-6f588568-zl55k                     0/1     CrashLoopBackOff   4 (29s ago)   3m53s
kube-system           cloud-controller-manager-sempre1-ctrl                   1/1     Running            0             20m
kube-system           cloud-controller-manager-sempre1-etcd                   1/1     Running            0             20m
kube-system           etcd-sempre1-etcd                                       1/1     Running            0             19m
kube-system           helm-install-rke2-calico-7dxlb                          0/1     Completed          2             20m
kube-system           helm-install-rke2-calico-crd-wzffm                      0/1     Completed          0             20m
kube-system           helm-install-rke2-coredns-zs9rl                         0/1     Completed          0             20m
kube-system           helm-install-rke2-ingress-nginx-gtkv8                   0/1     CrashLoopBackOff   6 (40s ago)   20m
kube-system           helm-install-rke2-metrics-server-blcf4                  0/1     CrashLoopBackOff   6 (51s ago)   20m
kube-system           kube-apiserver-sempre1-ctrl                             1/1     Running            0             20m
kube-system           kube-controller-manager-sempre1-ctrl                    1/1     Running            0             20m
kube-system           kube-proxy-sempre1-ctrl                                 1/1     Running            0             20m
kube-system           kube-proxy-sempre1-etcd                                 1/1     Running            0             20m
kube-system           kube-proxy-sempre1-wrk1                                 1/1     Running            0             17m
kube-system           kube-proxy-sempre1-wrk2                                 1/1     Running            0             17m
kube-system           kube-scheduler-sempre1-ctrl                             1/1     Running            0             20m
kube-system           rke2-coredns-rke2-coredns-58fd75f64b-kfb69              1/1     Running            0             19m
kube-system           rke2-coredns-rke2-coredns-58fd75f64b-rzpsg              1/1     Running            0             20m
kube-system           rke2-coredns-rke2-coredns-autoscaler-768bfc5985-hcf4b   1/1     Running            0             20m
tigera-operator       tigera-operator-586758ccf7-rc9tq                        1/1     Running            0             19m

Looking the logs seems cluster agent is unable to ping the rancher server, but if I do a curl on the same URL it responds with a pong.
ERROR: <https://rancher75182.senode.dev/ping> is not accessible (Could not resolve host: rancher75182.senode.dev)

helm pods report error as well:
/var/lib/rancher/rke2/bin/kubectl --kubeconfig /etc/rancher/rke2/rke2.yaml logs helm-install-rke2-ingress-nginx-gtkv8 -n cattle-system
Error from server (NotFound): pods "helm-install-rke2-ingress-nginx-gtkv8" not found
/var/lib/rancher/rke2/bin/kubectl --kubeconfig /etc/rancher/rke2/rke2.yaml logs helm-install-rke2-metrics-server-blcf4 -n cattle-system
Error from server (NotFound): pods "helm-install-rke2-metrics-server-blcf4" not found
c

creamy-pencil-82913

03/14/2023, 12:39 AM
This sounds like an issue with Rancher, not with RKE2?
That said…
Copy code
ERROR: <https://rancher75182.senode.dev/ping> is not accessible (Could not resolve host: rancher75182.senode.dev)
this hostname suggest that you are using a private DNS zone for your Rancher server. Can you confirm that the resolv.conf file on your nodes is properly configured to point at that? Check the RKE2 logs for a message about using 8.8.8.8 instead of your private DNS server.
Host resolv.conf includes loopback or multicast nameservers - kubelet will use autogenerated resolv.conf with nameserver 8.8.8.8
s

stale-painting-80203

03/14/2023, 12:42 AM
yes. I can ping the rancher server from all nodes on the downstream cluster
c

creamy-pencil-82913

03/14/2023, 12:42 AM
^^ itll look like that
make sure that your resolv.conf doesn’t point at multicast or loopback resolvers
If you can’t resolve private hostnames from within pods, RKE2 falling back to 8.8.8.8 is the most likely reason why
s

stale-painting-80203

03/14/2023, 12:46 AM
mine only has the nameserver pointing to an ipaddr
everything else is commented out
Even rebooting the cluster VMs on a working cluster causes the same issue and the cluster never comes back up again. Same pods go into a CrashLoop
c

creamy-pencil-82913

03/14/2023, 3:22 AM
Is the name server not reachable from within pods? What do the coredns pod logs say?
s

stale-painting-80203

03/14/2023, 4:03 AM
coredns pods have the right nameserver and can access from there. Unable to exec omnt the pods which are in a crashloop
seems bazar that a working cluster is unable to establish connection with a rancher after the cluster VMs were rebooted
c

creamy-pencil-82913

03/14/2023, 4:54 AM
So just to confirm, the coredns pods can resolve the rancher server address but the rancher agent pods cannot? How did you test that it can be resolved by coredns?
s

stale-painting-80203

03/14/2023, 5:04 AM
Recreated a fresh cluster and collected more data. It seems on cluster create the cluster-agent pods successfully come up even though they are unable to ping rancher. Cluster also connects to rancher. One has an error on the certificate and the other cannot reach it. After rebooting the cluster, the cluster-agent pods go in a crashloop
also from my local machine or VMs, I don't see any issues with rancher ping:
Copy code
curl <https://rancher75182.senode.dev/ping>
pong
Also to answer your question regarding coredns, Yes. exec into coredns and curl google.com or my rancher instance and it works, but not from within the cluster-agents
coredns IPs are 10.42.247.129 and 10.42.213.131, but the cluster-agent resolv.conf has a nameserver 10.43.0.10. Is that correct?
c

creamy-pencil-82913

03/14/2023, 9:01 PM
yes, that is the dns service address. Can you show me how specifically you are testing resolution from within the coredns pods?
s

stale-painting-80203

03/14/2023, 9:02 PM
Just doing a curl as shown above.
c

creamy-pencil-82913

03/14/2023, 9:03 PM
You said you were doing that from the local machine or VM, not inside the coredns pod
s

stale-painting-80203

03/14/2023, 9:05 PM
Both. I also created a new cluster on the same server as Rancher and it seems to work fine. I am able to restart the cluster and it comes back up. I suspect it’s some networking issue on the Harvester server on which the cluster fails to come up
After a restart
c

creamy-pencil-82913

03/14/2023, 9:07 PM
hmm
60 Views