https://rancher.com/ logo
Title
q

quiet-potato-9276

05/12/2023, 10:47 AM
Hello Team, I am trying to setup RKE on 4 CentOS Linux release 7.9.2009 (Core) VMs with 4 Cores and 8GB RAM. Configuration is 1 master node and 3 worker nodes. rke up ran fine without errors but I have issues with coredns not running which I think is caused by calico-kube-controllers in crash loop back off. The error from calico is that it cannot reach https://10.43.0.1:443/apis with a no route to host error - can curl that ip on all my nodes (including master). I thought it might have been SELinux so I disabled that but still the same issue. I've attached my cluster.yaml and some logs.
Canal
CoreDNS
Calico
g

great-jewelry-76121

05/12/2023, 10:59 AM
I have issues with coredns not running which I think is caused by calico-kube-controllers in crash loop back off.
It won't be -
calico-kube-controllers
doesn't do anything which affects the dataplane for other pods. Its essentially a garbage collector and label updater. Can you do a quick check for me - can pods talk to each other? On the same node? On different nodes?
q

quiet-potato-9276

05/12/2023, 11:05 AM
I can ping between busybox1 and busybox2 on different nodes:
[al@rkemaster01 ~]$ kubectl exec -ti busybox2 -- /bin/sh
E0512 12:03:20.136051   25714 memcache.go:287] couldn't get resource list for <http://metrics.k8s.io/v1beta1|metrics.k8s.io/v1beta1>: the server is currently unable to handle the request
E0512 12:03:20.138957   25714 memcache.go:121] couldn't get resource list for <http://metrics.k8s.io/v1beta1|metrics.k8s.io/v1beta1>: the server is currently unable to handle the request
E0512 12:03:20.143424   25714 memcache.go:121] couldn't get resource list for <http://metrics.k8s.io/v1beta1|metrics.k8s.io/v1beta1>: the server is currently unable to handle the request
/ #
/ # ping 10.42.3.7
PING 10.42.3.7 (10.42.3.7): 56 data bytes
64 bytes from 10.42.3.7: seq=0 ttl=62 time=1.524 ms
64 bytes from 10.42.3.7: seq=1 ttl=62 time=0.766 ms
^C
--- 10.42.3.7 ping statistics ---
2 packets transmitted, 2 packets received, 0% packet loss
round-trip min/avg/max = 0.766/1.145/1.524 ms
I've already started on upgrading the kernel in Centos to 5.x
I upgrade the kernel, and though the warning has gone there is no change:
[al@rkemaster01 ~]$ kubectl get pods -A
E0512 12:23:31.447388    4839 memcache.go:287] couldn't get resource list for <http://metrics.k8s.io/v1beta1|metrics.k8s.io/v1beta1>: the server is currently unable to handle the request
E0512 12:23:31.473213    4839 memcache.go:121] couldn't get resource list for <http://metrics.k8s.io/v1beta1|metrics.k8s.io/v1beta1>: the server is currently unable to handle the request
E0512 12:23:31.477644    4839 memcache.go:121] couldn't get resource list for <http://metrics.k8s.io/v1beta1|metrics.k8s.io/v1beta1>: the server is currently unable to handle the request
E0512 12:23:31.481753    4839 memcache.go:121] couldn't get resource list for <http://metrics.k8s.io/v1beta1|metrics.k8s.io/v1beta1>: the server is currently unable to handle the request
NAMESPACE       NAME                                      READY   STATUS             RESTARTS         AGE
default         busybox1                                  1/1     Running            2 (119s ago)     22m
default         busybox2                                  1/1     Running            1 (2m24s ago)    22m
default         nginx                                     1/1     Running            3 (104s ago)     72m
ingress-nginx   ingress-nginx-admission-create-p8z4t      0/1     Completed          0                73m
ingress-nginx   nginx-ingress-controller-kqczp            0/1     Running            30 (1s ago)      73m
ingress-nginx   nginx-ingress-controller-mdlxg            0/1     Running            30 (23s ago)     73m
ingress-nginx   nginx-ingress-controller-ms44h            0/1     Running            31 (2m36s ago)   73m
kube-system     calico-kube-controllers-85d56898c-swvqw   0/1     Running            30 (20s ago)     74m
kube-system     canal-h5lcp                               2/2     Running            6 (2m9s ago)     74m
kube-system     canal-rwz8j                               2/2     Running            6 (104s ago)     74m
kube-system     canal-sdfmv                               2/2     Running            6 (2m34s ago)    74m
kube-system     canal-trxkx                               2/2     Running            6 (3m3s ago)     74m
kube-system     coredns-autoscaler-74d474f45c-knhk7       1/1     Running            3 (2m34s ago)    74m
kube-system     coredns-dfb7f8fd4-7ncjq                   0/1     Running            3 (99s ago)      74m
kube-system     metrics-server-c47f7c9bb-g5jxw            0/1     CrashLoopBackOff   30 (52s ago)     74m
kube-system     rke-coredns-addon-deploy-job-2vqwq        0/1     Completed          0                74m
kube-system     rke-ingress-controller-deploy-job-6vk5x   0/1     Completed          0                74m
kube-system     rke-metrics-addon-deploy-job-rv5rp        0/1     Completed          0                74m
kube-system     rke-network-plugin-deploy-job-7d6zw       0/1     Completed          0                74m
g

great-jewelry-76121

05/12/2023, 11:33 AM
I can ping between busybox1 and busybox2 on different nodes:
Cool, that suggests that the canal parts are working at least. Is kube-proxy up and happy? That's what takes care of converting service IPs (like this one) into "real" IPs
q

quiet-potato-9276

05/12/2023, 11:34 AM
There is no kube-proxy running in the list of pods I gave.
g

great-jewelry-76121

05/12/2023, 11:34 AM
Does kube-proxy run on the nodes directly instead?
Or is that the issue? That you don't have kube-proxy, so service IPs are all broken?
(sorry, I'm familiar with Calico, less familiar with RKE - I work for Tigera on the Calico team)
q

quiet-potato-9276

05/12/2023, 11:35 AM
Kube-proxy is running as a process on the nodes:
root      1852  1717  0 12:21 ?        00:00:01 kube-proxy --cluster-cidr=10.42.0.0/16 --hostname-override=192.168.0.170 --kubeconfig=/etc/kubernetes/ssl/kubecfg-kube-proxy.yaml --healthz-bind-address=127.0.0.1 --v=2
g

great-jewelry-76121

05/12/2023, 11:36 AM
Any interesting logs out of kube-proxy?
q

quiet-potato-9276

05/12/2023, 11:43 AM
Untitled
Nothing of note, I'll check the other logs:
Getting these errors in kube-controller-manager log:
{"log":"E0512 11:52:16.996066       1 resource_quota_controller.go:417] unable to retrieve the complete list of server APIs: <http://metrics.k8s.io/v1beta1|metrics.k8s.io/v1beta1>: the server is currently unable to handle the request\n","stream":"stderr","time":"2023-05-12T11:52:16.996473805Z"}
{"log":"W0512 11:52:18.369195       1 garbagecollector.go:752] failed to discover some groups: map[<http://metrics.k8s.io/v1beta1:the|metrics.k8s.io/v1beta1:the> server is currently unable to handle the request]\n","stream":"stderr","time":"2023-05-12T11:52:18.370125543Z"}
So now see that error in the metrics server. I could see the problem is there as it is not providing the api response:
$ kubectl api-resources
error: unable to retrieve the complete list of server APIs: <http://metrics.k8s.io/v1beta1|metrics.k8s.io/v1beta1>: the server is currently unable to handle the request
and:
$ kubectl get apiservice
<http://v1beta1.metrics.k8s.io|v1beta1.metrics.k8s.io>                 kube-system/metrics-server   False (MissingEndpoints)   114m
I'm getting DNS is not working in pods and this error in coredns:
[WARNING] plugin/kubernetes: Kubernetes API connection failure: Get "<https://10.43.0.1:443/version>": dial tcp 10.43.0.1:443: i/o timeout
However it is up
[al@rkemaster01 ~]$ curl -k <https://10.43.0.1:443/>
{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {},
  "status": "Failure",
  "message": "Unauthorized",
  "reason": "Unauthorized",
  "code": 401
}
So it was DNS. I ran this on all the nodes:
iptables -P INPUT ACCEPT
iptables -P FORWARD ACCEPT
iptables -P OUTPUT ACCEPT
iptables -F
And killed the existing coredns pod. Now everything is working:
NAMESPACE       NAME                                      READY   STATUS      RESTARTS         AGE
default         busybox1                                  1/1     Running     2 (72m ago)      92m
default         busybox2                                  1/1     Running     1 (72m ago)      92m
default         nginx                                     1/1     Running     3 (72m ago)      142m
ingress-nginx   ingress-nginx-admission-create-p8z4t      0/1     Completed   0                144m
ingress-nginx   nginx-ingress-controller-ck5bh            1/1     Running     0                27s
ingress-nginx   nginx-ingress-controller-ct84l            1/1     Running     0                50s
ingress-nginx   nginx-ingress-controller-kqczp            1/1     Running     50 (7m20s ago)   144m
kube-system     calico-kube-controllers-85d56898c-swvqw   1/1     Running     52 (5m46s ago)   144m
kube-system     canal-h5lcp                               2/2     Running     6 (72m ago)      144m
kube-system     canal-rwz8j                               2/2     Running     6 (72m ago)      144m
kube-system     canal-sdfmv                               2/2     Running     6 (72m ago)      144m
kube-system     canal-trxkx                               2/2     Running     6 (73m ago)      144m
kube-system     coredns-autoscaler-74d474f45c-knhk7       1/1     Running     3 (72m ago)      144m
kube-system     coredns-dfb7f8fd4-9gz8j                   1/1     Running     0                3m19s
kube-system     coredns-dfb7f8fd4-dpdcp                   1/1     Running     0                9m15s
kube-system     metrics-server-c47f7c9bb-kjt8f            1/1     Running     0                2m35s
kube-system     rke-coredns-addon-deploy-job-2vqwq        0/1     Completed   0                144m
kube-system     rke-ingress-controller-deploy-job-6vk5x   0/1     Completed   0                144m
kube-system     rke-metrics-addon-deploy-job-rv5rp        0/1     Completed   0                144m
kube-system     rke-network-plugin-deploy-job-7d6zw       0/1     Completed   0                144m
Not sure what is causing this issue and if it will return after a reboot. Never really understood iptables.
g

great-jewelry-76121

05/12/2023, 12:50 PM
Do you have anything that might have been writing to iptables on the nodes? (Apart from kube-proxy and calico)
Do you have any network policy configured?
q

quiet-potato-9276

05/12/2023, 1:03 PM
No. These are fresh VMs and all I had done was bring the cluster up with rke
g

great-jewelry-76121

05/18/2023, 9:14 AM
These are fresh VMs
Sure, but some OSes enable firewalls by default which write to iptables. ufw, firewalld, etc.