square-engine-61315
04/10/2023, 6:23 PM$ kubectl describe pods --namespace=kube-system -l k8s-app=kube-dns
E0410 20:19:54.600425 17207 memcache.go:255] couldn't get resource list for <http://metrics.k8s.io/v1beta1|metrics.k8s.io/v1beta1>: the server is currently unable to handle the request
E0410 20:19:54.638617 17207 memcache.go:106] couldn't get resource list for <http://metrics.k8s.io/v1beta1|metrics.k8s.io/v1beta1>: the server is currently unable to handle the request
E0410 20:19:54.648790 17207 memcache.go:106] couldn't get resource list for <http://metrics.k8s.io/v1beta1|metrics.k8s.io/v1beta1>: the server is currently unable to handle the request
E0410 20:19:54.657229 17207 memcache.go:106] couldn't get resource list for <http://metrics.k8s.io/v1beta1|metrics.k8s.io/v1beta1>: the server is currently unable to handle the request
Name: coredns-5db57449d8-sm8zm
Namespace: kube-system
Priority: 2000000000
Priority Class Name: system-cluster-critical
Service Account: coredns
Node: jet/10.248.254.23
Start Time: Mon, 10 Apr 2023 11:46:50 +0200
Labels: k8s-app=kube-dns
pod-template-hash=5db57449d8
Annotations: <http://kubectl.kubernetes.io/restartedAt|kubectl.kubernetes.io/restartedAt>: 2022-10-01T06:14:51Z
Status: Running
IP: 10.42.0.175
IPs:
IP: 10.42.0.175
Controlled By: ReplicaSet/coredns-5db57449d8
Containers:
coredns:
Container ID: <containerd://fb8073917ecbcb6c13ee85b5310ff698e9402d8033a10fbd85fe97284be9e6f>b
Image: rancher/mirrored-coredns-coredns:1.9.4
Image ID: <http://docker.io/rancher/mirrored-coredns-coredns@sha256:823626055cba80e2ad6ff26e18df206c7f26964c7cd81a8ef57b4dc16c0eec61|docker.io/rancher/mirrored-coredns-coredns@sha256:823626055cba80e2ad6ff26e18df206c7f26964c7cd81a8ef57b4dc16c0eec61>
Ports: 53/UDP, 53/TCP, 9153/TCP
Host Ports: 0/UDP, 0/TCP, 0/TCP
Args:
-conf
/etc/coredns/Corefile
State: Running
Started: Mon, 10 Apr 2023 19:54:38 +0200
Last State: Terminated
Reason: Unknown
Exit Code: 255
Started: Mon, 10 Apr 2023 11:46:59 +0200
Finished: Mon, 10 Apr 2023 19:54:29 +0200
Ready: False
Restart Count: 1
Limits:
memory: 170Mi
Requests:
cpu: 100m
memory: 70Mi
Liveness: http-get http://:8080/health delay=60s timeout=1s period=10s #success=1 #failure=3
Readiness: http-get http://:8181/ready delay=0s timeout=1s period=2s #success=1 #failure=3
Environment: <none>
Mounts:
/etc/coredns from config-volume (ro)
/etc/coredns/custom from custom-config-volume (ro)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-lr69f (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
config-volume:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: coredns
Optional: false
custom-config-volume:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: coredns-custom
Optional: true
kube-api-access-lr69f:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: <http://kubernetes.io/os=linux|kubernetes.io/os=linux>
Tolerations: CriticalAddonsOnly op=Exists
<http://node-role.kubernetes.io/control-plane:NoSchedule|node-role.kubernetes.io/control-plane:NoSchedule> op=Exists
<http://node-role.kubernetes.io/master:NoSchedule|node-role.kubernetes.io/master:NoSchedule> op=Exists
<http://node.kubernetes.io/not-ready:NoExecute|node.kubernetes.io/not-ready:NoExecute> op=Exists for 300s
<http://node.kubernetes.io/unreachable:NoExecute|node.kubernetes.io/unreachable:NoExecute> op=Exists for 300s
Topology Spread Constraints: <http://kubernetes.io/hostname:DoNotSchedule|kubernetes.io/hostname:DoNotSchedule> when max skew 1 is exceeded for selector k8s-app=kube-dns
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Unhealthy 48m (x14319 over 8h) kubelet Readiness probe failed: HTTP probe failed with statuscode: 503
Warning Unhealthy 31m (x464 over 46m) kubelet Readiness probe failed: HTTP probe failed with statuscode: 503
Normal SandboxChanged 25m kubelet Pod sandbox changed, it will be killed and re-created.
Normal Pulled 25m kubelet Container image "rancher/mirrored-coredns-coredns:1.9.4" already present on machine
Normal Created 25m kubelet Created container coredns
Normal Started 25m kubelet Started container coredns
Warning Unhealthy 25m (x9 over 25m) kubelet Readiness probe failed: Get "<http://10.42.0.175:8181/ready>": dial tcp 10.42.0.175:8181: connect: connection refused
Warning Unhealthy 20s (x762 over 25m) kubelet Readiness probe failed: HTTP probe failed with statuscode: 503
couldn't get resource list for <http://metrics.k8s.io/v1beta1|metrics.k8s.io/v1beta1>: the server is currently unable to handle the request
in the output of every kubectl command I run...$ kubectl logs --namespace=kube-system -l k8s-app=kube-dns
E0410 20:24:27.852163 17672 memcache.go:255] couldn't get resource list for <http://metrics.k8s.io/v1beta1|metrics.k8s.io/v1beta1>: the server is currently unable to handle the request
E0410 20:24:27.943484 17672 memcache.go:106] couldn't get resource list for <http://metrics.k8s.io/v1beta1|metrics.k8s.io/v1beta1>: the server is currently unable to handle the request
E0410 20:24:27.956274 17672 memcache.go:106] couldn't get resource list for <http://metrics.k8s.io/v1beta1|metrics.k8s.io/v1beta1>: the server is currently unable to handle the request
[INFO] plugin/ready: Still waiting on: "kubernetes"
[INFO] plugin/ready: Still waiting on: "kubernetes"
[WARNING] No files matching import glob pattern: /etc/coredns/custom/*.server
[INFO] plugin/ready: Still waiting on: "kubernetes"
[WARNING] plugin/kubernetes: Kubernetes API connection failure: Get "<https://10.43.0.1:443/version>": dial tcp 10.43.0.1:443: i/o timeout
[INFO] plugin/ready: Still waiting on: "kubernetes"
[INFO] plugin/ready: Still waiting on: "kubernetes"
[INFO] plugin/ready: Still waiting on: "kubernetes"
[INFO] plugin/ready: Still waiting on: "kubernetes"
[INFO] plugin/ready: Still waiting on: "kubernetes"
The important bit here seems to be:
plugin/kubernetes: Kubernetes API connection failure: Get "<https://10.43.0.1:443/version>": dial tcp 10.43.0.1:443: i/o timeout
$ kubectl version --short
Flag --short has been deprecated, and will be removed in the future. The --short output will become the default.
Client Version: v1.26.0
Kustomize Version: v4.5.7
Server Version: v1.26.3+k3s1
$ kubectl cluster-info
Kubernetes control plane is running at <https://jet.galaxy:6443>
CoreDNS is running at <https://jet.galaxy:6443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy>
Metrics-server is running at <https://jet.galaxy:6443/api/v1/namespaces/kube-system/services/https:metrics-server:https/proxy>
creamy-pencil-82913
04/10/2023, 6:27 PMsquare-engine-61315
04/10/2023, 6:28 PMcreamy-pencil-82913
04/10/2023, 6:29 PMsquare-engine-61315
04/10/2023, 6:32 PMkubectl logs -n system-upgrade pod/system-upgrade-controller-5f9d54d49f-zkw6j
is not giving me anything useful, because it seems that the physical machines have been restarted between the time that the problem started and now, in an attempt to fix the problem (i.e. the old "try turning it off and on again").rough-farmer-49135
04/10/2023, 8:26 PMnslookup
in client mode (run with no parameters to get an interactive shell). You should be able to use the command server 10.43.0.1
to set your commands to go to that IP and then just toss in hostnames & see if it communicates. That or you could try netcat or something too.square-engine-61315
04/10/2023, 8:33 PMnslookup
. Maybe because I'm in an alpine container with a simple busybox version of nslookup.
netcat 10.43.0.1 does not throw an error, it seems to wait for input.
curl https://10.43.0.1/version gives me a certificate error (I'm assuming this is not a problem)
curl --insecure https://10.43.0.1/version gives me a JSON response with 401 Unauthorized
So I still don't quite get why the kube-dns pod is giving me
[WARNING] plugin/kubernetes: Kubernetes API connection failure: Get "<https://10.43.0.1:443/version>": dial tcp 10.43.0.1:443: i/o timeout
Unfortunately, I can't exec a shell into that pod, since it seems to have no shell. At least, it does not have /bin/sh or /bin/bash.rough-farmer-49135
04/10/2023, 8:44 PMsleep 3600
in a Ubuntu container or something and attach to that.square-engine-61315
04/10/2023, 8:45 PMkubectl --namespace=kube-system rollout restart deploy/coredns
and most things started coming online again (strangely enough, not my kubernetes dashboard - still debugging that).
But now I still don't know
• why this happened in the first place
• why a full reboot of all nodes in the cluster did nothing to solve the problem
• how to prevent this in the future
(thanks once again for the help so far)panic: Get "<https://10.43.0.1:443/api/v1/namespaces/kubernetes-dashboard/secrets/kubernetes-dashboard-csrf>": dial tcp 10.43.0.1:443: i/o timeout
this problem persists after doing a rollout restart
on deploy/kubernetes-dashboard
rough-farmer-49135
04/10/2023, 8:55 PMkubectl get pods -A -o wide
as you're restarting things and see if you have a problem worker node? You could also try scaling your deployment down to 0 completely and wait for all pods to die before scaling back up, sometimes you can get something weird persisting. Past that you're debugging network layer for an apparently rare problem, so if you have a service mesh and something like kiali installed that might help.square-engine-61315
04/10/2023, 9:07 PMkubernetes-dashboard
, cert-manager-cainjector
, and system-upgrade-controller
that are in CrashLoopBackOff
. They all run on my master node.
I scaled all three down to 0:
$ kubectl --namespace=kubernetes-dashboard scale deploy/kubernetes-dashboard --replicas=0
deployment.apps/kubernetes-dashboard scaled
$ kubectl --namespace=cert-manager scale deploy/cert-manager-cainjector --replicas=0
deployment.apps/cert-manager-cainjector scaled
$ kubectl --namespace=system-upgrade scale deploy/system-upgrade-controller --replicas=0
deployment.apps/system-upgrade-controller scaled
and back up with --replicas=1
I'm still getting dial tcp 10.43.0.1:443: i/o timeout
in kubernetes-dashboard
and in system-upgrade-controller
. cert-manager-cainjector
is working now. But maybe it's just not making any requests to 10.43.0.1 yet ...so if you have a service mesh and something like kiali installed that might help.This last sentence is way above my level of expertise. I will have to look up what "service mesh" means and what "kiali" is.
rough-farmer-49135
04/10/2023, 9:10 PMsquare-engine-61315
04/10/2023, 9:16 PMcreamy-pencil-82913
04/10/2023, 9:18 PMsquare-engine-61315
04/11/2023, 6:59 AMufw status verbose
gives me
Status: active
Logging: on (low)
Default: deny (incoming), allow (outgoing), allow (routed)
New profiles: skip
To Action From
-- ------ ----
16443/tcp ALLOW IN 10.248.254.0/24
10250/tcp ALLOW IN 10.248.254.0/24
10255/tcp ALLOW IN 10.248.254.0/24
25000/tcp ALLOW IN 10.248.254.0/24
12379/tcp ALLOW IN 10.248.254.0/24
10257/tcp ALLOW IN 10.248.254.0/24
10259/tcp ALLOW IN 10.248.254.0/24
19001/tcp ALLOW IN 10.248.254.0/24
22/tcp ALLOW IN Anywhere
5568/tcp ALLOW IN Anywhere
443/tcp ALLOW IN Anywhere
Anywhere on vxlan.calico ALLOW IN Anywhere
Anywhere on cali+ ALLOW IN Anywhere
32000/tcp ALLOW IN 10.248.254.0/24
6443/tcp ALLOW IN 10.248.254.0/24
8472/udp ALLOW IN 10.248.254.0/24
2379/tcp ALLOW IN 10.248.254.0/24
2380/tcp ALLOW IN 10.248.254.0/24
80/tcp ALLOW IN Anywhere
7946/tcp ALLOW IN 10.248.254.0/24
7946/udp ALLOW IN 10.248.254.0/24
30000/tcp ALLOW IN Anywhere
30001/tcp ALLOW IN Anywhere
30201/tcp ALLOW IN Anywhere
5201 ALLOW IN 10.248.254.0/24
51821:51830/udp ALLOW IN Anywhere
51820/udp ALLOW IN Anywhere
5201/tcp ALLOW IN 10.248.253.0/24
22/tcp (v6) ALLOW IN Anywhere (v6)
5568/tcp (v6) ALLOW IN Anywhere (v6)
443/tcp (v6) ALLOW IN Anywhere (v6)
Anywhere (v6) on vxlan.calico ALLOW IN Anywhere (v6)
Anywhere (v6) on cali+ ALLOW IN Anywhere (v6)
80/tcp (v6) ALLOW IN Anywhere (v6)
30000/tcp (v6) ALLOW IN Anywhere (v6)
30001/tcp (v6) ALLOW IN Anywhere (v6)
30201/tcp (v6) ALLOW IN Anywhere (v6)
51821:51830/udp (v6) ALLOW IN Anywhere (v6)
51820/udp (v6) ALLOW IN Anywhere (v6)
Anywhere ALLOW OUT Anywhere on vxlan.calico
Anywhere ALLOW OUT Anywhere on cali+
Anywhere (v6) ALLOW OUT Anywhere (v6) on vxlan.calico
Anywhere (v6) ALLOW OUT Anywhere (v6) on cali+
I don't know enough about kubernetes networking to see how this affects anything.ufw default accept incoming
is not an option for me, I think, because as far as I understand that would open up all ports to all network interfaces, including my VM's internet connection.vxlan.calico
or cali+
when I run ip link show
or ip addr show
. I do see interfaces named flannel.1
and cni0
. Is this normal?creamy-pencil-82913
04/11/2023, 7:33 AMsquare-engine-61315
04/11/2023, 7:33 AM8: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default
link/ether 0a:e7:e6:d6:6a:5f brd ff:ff:ff:ff:ff:ff
inet 10.42.0.0/32 scope global flannel.1
valid_lft forever preferred_lft forever
inet6 fe80::8e7:e6ff:fed6:6a5f/64 scope link
valid_lft forever preferred_lft forever
9: cni0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default qlen 1000
link/ether 56:ed:81:e8:e7:5b brd ff:ff:ff:ff:ff:ff
inet 10.42.0.1/24 brd 10.42.0.255 scope global cni0
valid_lft forever preferred_lft forever
inet6 fe80::54ed:81ff:fee8:e75b/64 scope link
valid_lft forever preferred_lft forever
Both of these have a 10.42.*
IP range. But isn't the kubernetes API supposed to be at 10.43.0.1
(43
, not 42
)?creamy-pencil-82913
04/11/2023, 7:34 AMsquare-engine-61315
04/11/2023, 7:38 AMufw allow 6443/tcp # 1. This would expose my API server to the internet. Not going to do this.
ufw allow from 10.42.0.0/16 to any # 2. Allow all pods to communicate with my host.
ufw allow from 10.43.0.0/16 to any # 3. Allow all services to communicate with my host.
After applying 2 and 3, it's working again!! 😛artyparrot:creamy-pencil-82913
04/11/2023, 9:24 AM