This message was deleted.
# k3s
a
This message was deleted.
s
Copy code
$ kubectl describe pods --namespace=kube-system -l k8s-app=kube-dns
Copy code
E0410 20:19:54.600425   17207 memcache.go:255] couldn't get resource list for <http://metrics.k8s.io/v1beta1|metrics.k8s.io/v1beta1>: the server is currently unable to handle the request
E0410 20:19:54.638617   17207 memcache.go:106] couldn't get resource list for <http://metrics.k8s.io/v1beta1|metrics.k8s.io/v1beta1>: the server is currently unable to handle the request
E0410 20:19:54.648790   17207 memcache.go:106] couldn't get resource list for <http://metrics.k8s.io/v1beta1|metrics.k8s.io/v1beta1>: the server is currently unable to handle the request
E0410 20:19:54.657229   17207 memcache.go:106] couldn't get resource list for <http://metrics.k8s.io/v1beta1|metrics.k8s.io/v1beta1>: the server is currently unable to handle the request
Name:                 coredns-5db57449d8-sm8zm
Namespace:            kube-system
Priority:             2000000000
Priority Class Name:  system-cluster-critical
Service Account:      coredns
Node:                 jet/10.248.254.23
Start Time:           Mon, 10 Apr 2023 11:46:50 +0200
Labels:               k8s-app=kube-dns
                      pod-template-hash=5db57449d8
Annotations:          <http://kubectl.kubernetes.io/restartedAt|kubectl.kubernetes.io/restartedAt>: 2022-10-01T06:14:51Z
Status:               Running
IP:                   10.42.0.175
IPs:
  IP:           10.42.0.175
Controlled By:  ReplicaSet/coredns-5db57449d8
Containers:
  coredns:
    Container ID:  <containerd://fb8073917ecbcb6c13ee85b5310ff698e9402d8033a10fbd85fe97284be9e6f>b
    Image:         rancher/mirrored-coredns-coredns:1.9.4
    Image ID:      <http://docker.io/rancher/mirrored-coredns-coredns@sha256:823626055cba80e2ad6ff26e18df206c7f26964c7cd81a8ef57b4dc16c0eec61|docker.io/rancher/mirrored-coredns-coredns@sha256:823626055cba80e2ad6ff26e18df206c7f26964c7cd81a8ef57b4dc16c0eec61>
    Ports:         53/UDP, 53/TCP, 9153/TCP
    Host Ports:    0/UDP, 0/TCP, 0/TCP
    Args:
      -conf
      /etc/coredns/Corefile
    State:          Running
      Started:      Mon, 10 Apr 2023 19:54:38 +0200
    Last State:     Terminated
      Reason:       Unknown
      Exit Code:    255
      Started:      Mon, 10 Apr 2023 11:46:59 +0200
      Finished:     Mon, 10 Apr 2023 19:54:29 +0200
    Ready:          False
    Restart Count:  1
    Limits:
      memory:  170Mi
    Requests:
      cpu:        100m
      memory:     70Mi
    Liveness:     http-get http://:8080/health delay=60s timeout=1s period=10s #success=1 #failure=3
    Readiness:    http-get http://:8181/ready delay=0s timeout=1s period=2s #success=1 #failure=3
    Environment:  <none>
    Mounts:
      /etc/coredns from config-volume (ro)
      /etc/coredns/custom from custom-config-volume (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-lr69f (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  config-volume:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      coredns
    Optional:  false
  custom-config-volume:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      coredns-custom
    Optional:  true
  kube-api-access-lr69f:
    Type:                     Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:   3607
    ConfigMapName:            kube-root-ca.crt
    ConfigMapOptional:        <nil>
    DownwardAPI:              true
QoS Class:                    Burstable
Node-Selectors:               <http://kubernetes.io/os=linux|kubernetes.io/os=linux>
Tolerations:                  CriticalAddonsOnly op=Exists
                              <http://node-role.kubernetes.io/control-plane:NoSchedule|node-role.kubernetes.io/control-plane:NoSchedule> op=Exists
                              <http://node-role.kubernetes.io/master:NoSchedule|node-role.kubernetes.io/master:NoSchedule> op=Exists
                              <http://node.kubernetes.io/not-ready:NoExecute|node.kubernetes.io/not-ready:NoExecute> op=Exists for 300s
                              <http://node.kubernetes.io/unreachable:NoExecute|node.kubernetes.io/unreachable:NoExecute> op=Exists for 300s
Topology Spread Constraints:  <http://kubernetes.io/hostname:DoNotSchedule|kubernetes.io/hostname:DoNotSchedule> when max skew 1 is exceeded for selector k8s-app=kube-dns
Events:
  Type     Reason          Age                   From     Message
  ----     ------          ----                  ----     -------
  Warning  Unhealthy       48m (x14319 over 8h)  kubelet  Readiness probe failed: HTTP probe failed with statuscode: 503
  Warning  Unhealthy       31m (x464 over 46m)   kubelet  Readiness probe failed: HTTP probe failed with statuscode: 503
  Normal   SandboxChanged  25m                   kubelet  Pod sandbox changed, it will be killed and re-created.
  Normal   Pulled          25m                   kubelet  Container image "rancher/mirrored-coredns-coredns:1.9.4" already present on machine
  Normal   Created         25m                   kubelet  Created container coredns
  Normal   Started         25m                   kubelet  Started container coredns
  Warning  Unhealthy       25m (x9 over 25m)     kubelet  Readiness probe failed: Get "<http://10.42.0.175:8181/ready>": dial tcp 10.42.0.175:8181: connect: connection refused
  Warning  Unhealthy       20s (x762 over 25m)   kubelet  Readiness probe failed: HTTP probe failed with statuscode: 503
I'm also getting
Copy code
couldn't get resource list for <http://metrics.k8s.io/v1beta1|metrics.k8s.io/v1beta1>: the server is currently unable to handle the request
in the output of every kubectl command I run...
Copy code
$ kubectl logs --namespace=kube-system -l k8s-app=kube-dns
Copy code
E0410 20:24:27.852163   17672 memcache.go:255] couldn't get resource list for <http://metrics.k8s.io/v1beta1|metrics.k8s.io/v1beta1>: the server is currently unable to handle the request
E0410 20:24:27.943484   17672 memcache.go:106] couldn't get resource list for <http://metrics.k8s.io/v1beta1|metrics.k8s.io/v1beta1>: the server is currently unable to handle the request
E0410 20:24:27.956274   17672 memcache.go:106] couldn't get resource list for <http://metrics.k8s.io/v1beta1|metrics.k8s.io/v1beta1>: the server is currently unable to handle the request
[INFO] plugin/ready: Still waiting on: "kubernetes"
[INFO] plugin/ready: Still waiting on: "kubernetes"
[WARNING] No files matching import glob pattern: /etc/coredns/custom/*.server
[INFO] plugin/ready: Still waiting on: "kubernetes"
[WARNING] plugin/kubernetes: Kubernetes API connection failure: Get "<https://10.43.0.1:443/version>": dial tcp 10.43.0.1:443: i/o timeout
[INFO] plugin/ready: Still waiting on: "kubernetes"
[INFO] plugin/ready: Still waiting on: "kubernetes"
[INFO] plugin/ready: Still waiting on: "kubernetes"
[INFO] plugin/ready: Still waiting on: "kubernetes"
[INFO] plugin/ready: Still waiting on: "kubernetes"
The important bit here seems to be:
Copy code
plugin/kubernetes: Kubernetes API connection failure: Get "<https://10.43.0.1:443/version>": dial tcp 10.43.0.1:443: i/o timeout
Copy code
$ kubectl version --short
Copy code
Flag --short has been deprecated, and will be removed in the future. The --short output will become the default.
Client Version: v1.26.0
Kustomize Version: v4.5.7
Server Version: v1.26.3+k3s1
Copy code
$ kubectl cluster-info
Copy code
Kubernetes control plane is running at <https://jet.galaxy:6443>
CoreDNS is running at <https://jet.galaxy:6443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy>
Metrics-server is running at <https://jet.galaxy:6443/api/v1/namespaces/kube-system/services/https:metrics-server:https/proxy>
Any ideas? 🙂
c
When did you upgrade to 1.26.3?
or is this a fresh cluster
s
Hmm, I have some kind of auto-upgrade thing going. How can I check this?
c
look at your logs?
you said nobody did anything, did you look at the upgrade-controller logs?
or check the k3s systemd logs to check for nodes being restarted?
s
Thanks, I'll check those now!
Copy code
kubectl logs -n system-upgrade pod/system-upgrade-controller-5f9d54d49f-zkw6j
is not giving me anything useful, because it seems that the physical machines have been restarted between the time that the problem started and now, in an attempt to fix the problem (i.e. the old "try turning it off and on again").
g2g, I'll check back in a few hours. Thanks for the advice so far!
It looks like I can't ping 10.43.0.1 from inside any pod
After debugging for a few more hours, I'm at a total loss.
r
Ping doesn't pass through all networking, it's ICMP rather than TCP or UDP. Try running
nslookup
in client mode (run with no parameters to get an interactive shell). You should be able to use the command
server 10.43.0.1
to set your commands to go to that IP and then just toss in hostnames & see if it communicates. That or you could try netcat or something too.
s
Thanks for the tip. From a random pod I use for debugging: I don't get an interactive shell running bare
nslookup
. Maybe because I'm in an alpine container with a simple busybox version of nslookup. netcat 10.43.0.1 does not throw an error, it seems to wait for input. curl https://10.43.0.1/version gives me a certificate error (I'm assuming this is not a problem) curl --insecure https://10.43.0.1/version gives me a JSON response with 401 Unauthorized So I still don't quite get why the kube-dns pod is giving me
Copy code
[WARNING] plugin/kubernetes: Kubernetes API connection failure: Get "<https://10.43.0.1:443/version>": dial tcp 10.43.0.1:443: i/o timeout
Unfortunately, I can't exec a shell into that pod, since it seems to have no shell. At least, it does not have /bin/sh or /bin/bash.
r
I don't remember the protocol for DNS, just that it's TCP & UDP 53. You could always launch a job that just runs
sleep 3600
in a Ubuntu container or something and attach to that.
s
I just tried
Copy code
kubectl --namespace=kube-system rollout restart deploy/coredns
and most things started coming online again (strangely enough, not my kubernetes dashboard - still debugging that). But now I still don't know • why this happened in the first place • why a full reboot of all nodes in the cluster did nothing to solve the problem • how to prevent this in the future (thanks once again for the help so far)
So coredns is working now, but my kubernetes dashboard pod is still failing with the same error that I saw in other pods before:
Copy code
panic: Get "<https://10.43.0.1:443/api/v1/namespaces/kubernetes-dashboard/secrets/kubernetes-dashboard-csrf>": dial tcp 10.43.0.1:443: i/o timeout
this problem persists after doing a
rollout restart
on
deploy/kubernetes-dashboard
r
Oh, sorry I blanked on well known IPs there and was thinking 10.43.0.1 was coredns rather than the main apiserver. The thing you might look at doing is check
kubectl get pods -A -o wide
as you're restarting things and see if you have a problem worker node? You could also try scaling your deployment down to 0 completely and wait for all pods to die before scaling back up, sometimes you can get something weird persisting. Past that you're debugging network layer for an apparently rare problem, so if you have a service mesh and something like kiali installed that might help.
s
All my workers seem to be doing fine. It's just a few system-things, like the
kubernetes-dashboard
,
cert-manager-cainjector
, and
system-upgrade-controller
that are in
CrashLoopBackOff
. They all run on my master node. I scaled all three down to 0:
Copy code
$ kubectl --namespace=kubernetes-dashboard scale deploy/kubernetes-dashboard --replicas=0
deployment.apps/kubernetes-dashboard scaled

$ kubectl --namespace=cert-manager scale deploy/cert-manager-cainjector --replicas=0
deployment.apps/cert-manager-cainjector scaled

$ kubectl --namespace=system-upgrade scale deploy/system-upgrade-controller --replicas=0
deployment.apps/system-upgrade-controller scaled
and back up with
--replicas=1
I'm still getting
dial tcp 10.43.0.1:443: i/o timeout
in
kubernetes-dashboard
and in
system-upgrade-controller
.
cert-manager-cainjector
is working now. But maybe it's just not making any requests to 10.43.0.1 yet ...
so if you have a service mesh and something like kiali installed that might help.
This last sentence is way above my level of expertise. I will have to look up what "service mesh" means and what "kiali" is.
r
Ok, then probably not. A service mesh is an abstracted network layer to let you do various things, including logging all network packets. Kiali (doesn't look spelled right, but I haven't used it myself) is a visualization app for that sort of data.
Nah, you're seeing the request. Not sure what to say on finding the problem you're seeing if it's that rare and goes away easily. Not really an answer but you could try k9s or something else client-side to take the place of kubernetes-dashboard.
s
I'm going to get some sleep (UTC+2 here) and continue this quest tomorrow 👋
c
sounds like you might be running into https://github.com/k3s-io/k3s/issues/7203 - can you confirm that you don’t have any iptables rules configured on your host that might be blocking access to the Kubernetes networks now that kube-router no longer manages rules to allow access?
s
I don't recall ever using iptables directly, but I do have ufw set up. The host is Ubuntu 20.04.3 LTS.
Copy code
ufw status verbose
gives me
Copy code
Status: active
Logging: on (low)
Default: deny (incoming), allow (outgoing), allow (routed)
New profiles: skip

To                         Action      From
--                         ------      ----
16443/tcp                  ALLOW IN    10.248.254.0/24
10250/tcp                  ALLOW IN    10.248.254.0/24
10255/tcp                  ALLOW IN    10.248.254.0/24
25000/tcp                  ALLOW IN    10.248.254.0/24
12379/tcp                  ALLOW IN    10.248.254.0/24
10257/tcp                  ALLOW IN    10.248.254.0/24
10259/tcp                  ALLOW IN    10.248.254.0/24
19001/tcp                  ALLOW IN    10.248.254.0/24
22/tcp                     ALLOW IN    Anywhere
5568/tcp                   ALLOW IN    Anywhere
443/tcp                    ALLOW IN    Anywhere
Anywhere on vxlan.calico   ALLOW IN    Anywhere
Anywhere on cali+          ALLOW IN    Anywhere
32000/tcp                  ALLOW IN    10.248.254.0/24
6443/tcp                   ALLOW IN    10.248.254.0/24
8472/udp                   ALLOW IN    10.248.254.0/24
2379/tcp                   ALLOW IN    10.248.254.0/24
2380/tcp                   ALLOW IN    10.248.254.0/24
80/tcp                     ALLOW IN    Anywhere
7946/tcp                   ALLOW IN    10.248.254.0/24
7946/udp                   ALLOW IN    10.248.254.0/24
30000/tcp                  ALLOW IN    Anywhere
30001/tcp                  ALLOW IN    Anywhere
30201/tcp                  ALLOW IN    Anywhere
5201                       ALLOW IN    10.248.254.0/24
51821:51830/udp            ALLOW IN    Anywhere
51820/udp                  ALLOW IN    Anywhere
5201/tcp                   ALLOW IN    10.248.253.0/24
22/tcp (v6)                ALLOW IN    Anywhere (v6)
5568/tcp (v6)              ALLOW IN    Anywhere (v6)
443/tcp (v6)               ALLOW IN    Anywhere (v6)
Anywhere (v6) on vxlan.calico ALLOW IN    Anywhere (v6)
Anywhere (v6) on cali+     ALLOW IN    Anywhere (v6)
80/tcp (v6)                ALLOW IN    Anywhere (v6)
30000/tcp (v6)             ALLOW IN    Anywhere (v6)
30001/tcp (v6)             ALLOW IN    Anywhere (v6)
30201/tcp (v6)             ALLOW IN    Anywhere (v6)
51821:51830/udp (v6)       ALLOW IN    Anywhere (v6)
51820/udp (v6)             ALLOW IN    Anywhere (v6)

Anywhere                   ALLOW OUT   Anywhere on vxlan.calico
Anywhere                   ALLOW OUT   Anywhere on cali+
Anywhere (v6)              ALLOW OUT   Anywhere (v6) on vxlan.calico
Anywhere (v6)              ALLOW OUT   Anywhere (v6) on cali+
I don't know enough about kubernetes networking to see how this affects anything.
Do I just need to open something extra in the firewall?
ufw default accept incoming
is not an option for me, I think, because as far as I understand that would open up all ports to all network interfaces, including my VM's internet connection.
What's confusing me here is that I don't see any interfaces named
vxlan.calico
or
cali+
when I run
ip link show
or
ip addr show
. I do see interfaces named
flannel.1
and
cni0
. Is this normal?
c
Yes
s
Another thing that I don't understand:
Copy code
8: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default
    link/ether 0a:e7:e6:d6:6a:5f brd ff:ff:ff:ff:ff:ff
    inet 10.42.0.0/32 scope global flannel.1
       valid_lft forever preferred_lft forever
    inet6 fe80::8e7:e6ff:fed6:6a5f/64 scope link
       valid_lft forever preferred_lft forever
9: cni0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default qlen 1000
    link/ether 56:ed:81:e8:e7:5b brd ff:ff:ff:ff:ff:ff
    inet 10.42.0.1/24 brd 10.42.0.255 scope global cni0
       valid_lft forever preferred_lft forever
    inet6 fe80::54ed:81ff:fee8:e75b/64 scope link
       valid_lft forever preferred_lft forever
Both of these have a
10.42.*
IP range. But isn't the kubernetes API supposed to be at
10.43.0.1
(
43
, not
42
)?
c
Kube-router used to add rules that bypassed the ufw configuration and allowed all cluster traffic, but due to some changes that's been removed from the most recent releases. You need to ensure you actually allow cluster traffic now if using a host firewall.
💡 1
We will probably restore the old behavior for the next round of releases
s
Thanks, I think I understand now! Does the recommendation at https://docs.k3s.io/advanced#ubuntu persist even after the next release?
As far as I can see:
Copy code
ufw allow 6443/tcp # 1. This would expose my API server to the internet. Not going to do this.
ufw allow from 10.42.0.0/16 to any # 2. Allow all pods to communicate with my host.
ufw allow from 10.43.0.0/16 to any # 3. Allow all services to communicate with my host.
After applying 2 and 3, it's working again!! 🦜
THANK YOU!!
c
Yes, that always should have been done, and still should be done going forward. The bug that was 'fixed' allowed k3s to work even without proper allow rules in place.
1748 Views