This message was deleted Rancher Users #k3s

Join Slack

This message was deleted.

# k3s

adamant-kite-43734

04/10/2023, 6:23 PM

This message was deleted.

square-engine-61315

04/10/2023, 6:23 PM

Copy code

$ kubectl describe pods --namespace=kube-system -l k8s-app=kube-dns

Copy code

E0410 20:19:54.600425   17207 memcache.go:255] couldn't get resource list for <http://metrics.k8s.io/v1beta1|metrics.k8s.io/v1beta1>: the server is currently unable to handle the request
E0410 20:19:54.638617   17207 memcache.go:106] couldn't get resource list for <http://metrics.k8s.io/v1beta1|metrics.k8s.io/v1beta1>: the server is currently unable to handle the request
E0410 20:19:54.648790   17207 memcache.go:106] couldn't get resource list for <http://metrics.k8s.io/v1beta1|metrics.k8s.io/v1beta1>: the server is currently unable to handle the request
E0410 20:19:54.657229   17207 memcache.go:106] couldn't get resource list for <http://metrics.k8s.io/v1beta1|metrics.k8s.io/v1beta1>: the server is currently unable to handle the request
Name:                 coredns-5db57449d8-sm8zm
Namespace:            kube-system
Priority:             2000000000
Priority Class Name:  system-cluster-critical
Service Account:      coredns
Node:                 jet/10.248.254.23
Start Time:           Mon, 10 Apr 2023 11:46:50 +0200
Labels:               k8s-app=kube-dns
                      pod-template-hash=5db57449d8
Annotations:          <http://kubectl.kubernetes.io/restartedAt|kubectl.kubernetes.io/restartedAt>: 2022-10-01T06:14:51Z
Status:               Running
IP:                   10.42.0.175
IPs:
  IP:           10.42.0.175
Controlled By:  ReplicaSet/coredns-5db57449d8
Containers:
  coredns:
    Container ID:  <containerd://fb8073917ecbcb6c13ee85b5310ff698e9402d8033a10fbd85fe97284be9e6f>b
    Image:         rancher/mirrored-coredns-coredns:1.9.4
    Image ID:      <http://docker.io/rancher/mirrored-coredns-coredns@sha256:823626055cba80e2ad6ff26e18df206c7f26964c7cd81a8ef57b4dc16c0eec61|docker.io/rancher/mirrored-coredns-coredns@sha256:823626055cba80e2ad6ff26e18df206c7f26964c7cd81a8ef57b4dc16c0eec61>
    Ports:         53/UDP, 53/TCP, 9153/TCP
    Host Ports:    0/UDP, 0/TCP, 0/TCP
    Args:
      -conf
      /etc/coredns/Corefile
    State:          Running
      Started:      Mon, 10 Apr 2023 19:54:38 +0200
    Last State:     Terminated
      Reason:       Unknown
      Exit Code:    255
      Started:      Mon, 10 Apr 2023 11:46:59 +0200
      Finished:     Mon, 10 Apr 2023 19:54:29 +0200
    Ready:          False
    Restart Count:  1
    Limits:
      memory:  170Mi
    Requests:
      cpu:        100m
      memory:     70Mi
    Liveness:     http-get http://:8080/health delay=60s timeout=1s period=10s #success=1 #failure=3
    Readiness:    http-get http://:8181/ready delay=0s timeout=1s period=2s #success=1 #failure=3
    Environment:  <none>
    Mounts:
      /etc/coredns from config-volume (ro)
      /etc/coredns/custom from custom-config-volume (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-lr69f (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  config-volume:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      coredns
    Optional:  false
  custom-config-volume:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      coredns-custom
    Optional:  true
  kube-api-access-lr69f:
    Type:                     Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:   3607
    ConfigMapName:            kube-root-ca.crt
    ConfigMapOptional:        <nil>
    DownwardAPI:              true
QoS Class:                    Burstable
Node-Selectors:               <http://kubernetes.io/os=linux|kubernetes.io/os=linux>
Tolerations:                  CriticalAddonsOnly op=Exists
                              <http://node-role.kubernetes.io/control-plane:NoSchedule|node-role.kubernetes.io/control-plane:NoSchedule> op=Exists
                              <http://node-role.kubernetes.io/master:NoSchedule|node-role.kubernetes.io/master:NoSchedule> op=Exists
                              <http://node.kubernetes.io/not-ready:NoExecute|node.kubernetes.io/not-ready:NoExecute> op=Exists for 300s
                              <http://node.kubernetes.io/unreachable:NoExecute|node.kubernetes.io/unreachable:NoExecute> op=Exists for 300s
Topology Spread Constraints:  <http://kubernetes.io/hostname:DoNotSchedule|kubernetes.io/hostname:DoNotSchedule> when max skew 1 is exceeded for selector k8s-app=kube-dns
Events:
  Type     Reason          Age                   From     Message
  ----     ------          ----                  ----     -------
  Warning  Unhealthy       48m (x14319 over 8h)  kubelet  Readiness probe failed: HTTP probe failed with statuscode: 503
  Warning  Unhealthy       31m (x464 over 46m)   kubelet  Readiness probe failed: HTTP probe failed with statuscode: 503
  Normal   SandboxChanged  25m                   kubelet  Pod sandbox changed, it will be killed and re-created.
  Normal   Pulled          25m                   kubelet  Container image "rancher/mirrored-coredns-coredns:1.9.4" already present on machine
  Normal   Created         25m                   kubelet  Created container coredns
  Normal   Started         25m                   kubelet  Started container coredns
  Warning  Unhealthy       25m (x9 over 25m)     kubelet  Readiness probe failed: Get "<http://10.42.0.175:8181/ready>": dial tcp 10.42.0.175:8181: connect: connection refused
  Warning  Unhealthy       20s (x762 over 25m)   kubelet  Readiness probe failed: HTTP probe failed with statuscode: 503

square-engine-61315

04/10/2023, 6:24 PM

I'm also getting

Copy code

couldn't get resource list for <http://metrics.k8s.io/v1beta1|metrics.k8s.io/v1beta1>: the server is currently unable to handle the request

in the output of every kubectl command I run...

square-engine-61315

04/10/2023, 6:24 PM

Copy code

$ kubectl logs --namespace=kube-system -l k8s-app=kube-dns

Copy code

E0410 20:24:27.852163   17672 memcache.go:255] couldn't get resource list for <http://metrics.k8s.io/v1beta1|metrics.k8s.io/v1beta1>: the server is currently unable to handle the request
E0410 20:24:27.943484   17672 memcache.go:106] couldn't get resource list for <http://metrics.k8s.io/v1beta1|metrics.k8s.io/v1beta1>: the server is currently unable to handle the request
E0410 20:24:27.956274   17672 memcache.go:106] couldn't get resource list for <http://metrics.k8s.io/v1beta1|metrics.k8s.io/v1beta1>: the server is currently unable to handle the request
[INFO] plugin/ready: Still waiting on: "kubernetes"
[INFO] plugin/ready: Still waiting on: "kubernetes"
[WARNING] No files matching import glob pattern: /etc/coredns/custom/*.server
[INFO] plugin/ready: Still waiting on: "kubernetes"
[WARNING] plugin/kubernetes: Kubernetes API connection failure: Get "<https://10.43.0.1:443/version>": dial tcp 10.43.0.1:443: i/o timeout
[INFO] plugin/ready: Still waiting on: "kubernetes"
[INFO] plugin/ready: Still waiting on: "kubernetes"
[INFO] plugin/ready: Still waiting on: "kubernetes"
[INFO] plugin/ready: Still waiting on: "kubernetes"
[INFO] plugin/ready: Still waiting on: "kubernetes"

The important bit here seems to be:

Copy code

plugin/kubernetes: Kubernetes API connection failure: Get "<https://10.43.0.1:443/version>": dial tcp 10.43.0.1:443: i/o timeout

square-engine-61315

04/10/2023, 6:25 PM

Copy code

$ kubectl version --short

Copy code

Flag --short has been deprecated, and will be removed in the future. The --short output will become the default.
Client Version: v1.26.0
Kustomize Version: v4.5.7
Server Version: v1.26.3+k3s1

square-engine-61315

04/10/2023, 6:26 PM

Copy code

$ kubectl cluster-info

Copy code

Kubernetes control plane is running at <https://jet.galaxy:6443>
CoreDNS is running at <https://jet.galaxy:6443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy>
Metrics-server is running at <https://jet.galaxy:6443/api/v1/namespaces/kube-system/services/https:metrics-server:https/proxy>

square-engine-61315

04/10/2023, 6:26 PM

Any ideas? 🙂

creamy-pencil-82913

04/10/2023, 6:27 PM

When did you upgrade to 1.26.3?

creamy-pencil-82913

04/10/2023, 6:27 PM

or is this a fresh cluster

square-engine-61315

04/10/2023, 6:28 PM

Hmm, I have some kind of auto-upgrade thing going. How can I check this?

creamy-pencil-82913

04/10/2023, 6:29 PM

look at your logs?

creamy-pencil-82913

04/10/2023, 6:29 PM

you said nobody did anything, did you look at the upgrade-controller logs?

creamy-pencil-82913

04/10/2023, 6:30 PM

or check the k3s systemd logs to check for nodes being restarted?

square-engine-61315

04/10/2023, 6:32 PM

Thanks, I'll check those now!

square-engine-61315

04/10/2023, 6:35 PM

Copy code

kubectl logs -n system-upgrade pod/system-upgrade-controller-5f9d54d49f-zkw6j

is not giving me anything useful, because it seems that the physical machines have been restarted between the time that the problem started and now, in an attempt to fix the problem (i.e. the old "try turning it off and on again").

square-engine-61315

04/10/2023, 6:36 PM

g2g, I'll check back in a few hours. Thanks for the advice so far!

square-engine-61315

04/10/2023, 8:18 PM

It looks like I can't ping 10.43.0.1 from inside any pod

square-engine-61315

04/10/2023, 8:23 PM

After debugging for a few more hours, I'm at a total loss.

rough-farmer-49135

04/10/2023, 8:26 PM

Ping doesn't pass through all networking, it's ICMP rather than TCP or UDP. Try running

nslookup

in client mode (run with no parameters to get an interactive shell). You should be able to use the command

server 10.43.0.1

to set your commands to go to that IP and then just toss in hostnames & see if it communicates. That or you could try netcat or something too.

square-engine-61315

04/10/2023, 8:33 PM

Thanks for the tip. From a random pod I use for debugging: I don't get an interactive shell running bare

nslookup

. Maybe because I'm in an alpine container with a simple busybox version of nslookup. netcat 10.43.0.1 does not throw an error, it seems to wait for input. curl https://10.43.0.1/version gives me a certificate error (I'm assuming this is not a problem) curl --insecure https://10.43.0.1/version gives me a JSON response with 401 Unauthorized So I still don't quite get why the kube-dns pod is giving me

Copy code

[WARNING] plugin/kubernetes: Kubernetes API connection failure: Get "<https://10.43.0.1:443/version>": dial tcp 10.43.0.1:443: i/o timeout

Unfortunately, I can't exec a shell into that pod, since it seems to have no shell. At least, it does not have /bin/sh or /bin/bash.

rough-farmer-49135

04/10/2023, 8:44 PM

I don't remember the protocol for DNS, just that it's TCP & UDP 53. You could always launch a job that just runs

sleep 3600

in a Ubuntu container or something and attach to that.

square-engine-61315

04/10/2023, 8:45 PM

I just tried

Copy code

kubectl --namespace=kube-system rollout restart deploy/coredns

and most things started coming online again (strangely enough, not my kubernetes dashboard - still debugging that). But now I still don't know • why this happened in the first place • why a full reboot of all nodes in the cluster did nothing to solve the problem • how to prevent this in the future (thanks once again for the help so far)

square-engine-61315

04/10/2023, 8:50 PM

So coredns is working now, but my kubernetes dashboard pod is still failing with the same error that I saw in other pods before:

Copy code

panic: Get "<https://10.43.0.1:443/api/v1/namespaces/kubernetes-dashboard/secrets/kubernetes-dashboard-csrf>": dial tcp 10.43.0.1:443: i/o timeout

this problem persists after doing a

rollout restart

deploy/kubernetes-dashboard

rough-farmer-49135

04/10/2023, 8:55 PM

Oh, sorry I blanked on well known IPs there and was thinking 10.43.0.1 was coredns rather than the main apiserver. The thing you might look at doing is check

kubectl get pods -A -o wide

as you're restarting things and see if you have a problem worker node? You could also try scaling your deployment down to 0 completely and wait for all pods to die before scaling back up, sometimes you can get something weird persisting. Past that you're debugging network layer for an apparently rare problem, so if you have a service mesh and something like kiali installed that might help.

square-engine-61315

04/10/2023, 9:07 PM

All my workers seem to be doing fine. It's just a few system-things, like the

kubernetes-dashboard

cert-manager-cainjector

, and

system-upgrade-controller

that are in

CrashLoopBackOff

. They all run on my master node. I scaled all three down to 0:

Copy code

$ kubectl --namespace=kubernetes-dashboard scale deploy/kubernetes-dashboard --replicas=0
deployment.apps/kubernetes-dashboard scaled

$ kubectl --namespace=cert-manager scale deploy/cert-manager-cainjector --replicas=0
deployment.apps/cert-manager-cainjector scaled

$ kubectl --namespace=system-upgrade scale deploy/system-upgrade-controller --replicas=0
deployment.apps/system-upgrade-controller scaled

and back up with

--replicas=1

I'm still getting

dial tcp 10.43.0.1:443: i/o timeout

kubernetes-dashboard

and in

system-upgrade-controller

cert-manager-cainjector

is working now. But maybe it's just not making any requests to 10.43.0.1 yet ...

square-engine-61315

04/10/2023, 9:09 PM

so if you have a service mesh and something like kiali installed that might help.

This last sentence is way above my level of expertise. I will have to look up what "service mesh" means and what "kiali" is.

rough-farmer-49135

04/10/2023, 9:10 PM

Ok, then probably not. A service mesh is an abstracted network layer to let you do various things, including logging all network packets. Kiali (doesn't look spelled right, but I haven't used it myself) is a visualization app for that sort of data.

rough-farmer-49135

04/10/2023, 9:12 PM

Nah, you're seeing the request. Not sure what to say on finding the problem you're seeing if it's that rare and goes away easily. Not really an answer but you could try k9s or something else client-side to take the place of kubernetes-dashboard.

square-engine-61315

04/10/2023, 9:16 PM

I'm going to get some sleep (UTC+2 here) and continue this quest tomorrow 👋

creamy-pencil-82913

04/10/2023, 9:18 PM

sounds like you might be running into https://github.com/k3s-io/k3s/issues/7203 - can you confirm that you don’t have any iptables rules configured on your host that might be blocking access to the Kubernetes networks now that kube-router no longer manages rules to allow access?

square-engine-61315

04/11/2023, 6:59 AM

I don't recall ever using iptables directly, but I do have ufw set up. The host is Ubuntu 20.04.3 LTS.

Copy code

ufw status verbose

gives me

Copy code

Status: active
Logging: on (low)
Default: deny (incoming), allow (outgoing), allow (routed)
New profiles: skip

To                         Action      From
--                         ------      ----
16443/tcp                  ALLOW IN    10.248.254.0/24
10250/tcp                  ALLOW IN    10.248.254.0/24
10255/tcp                  ALLOW IN    10.248.254.0/24
25000/tcp                  ALLOW IN    10.248.254.0/24
12379/tcp                  ALLOW IN    10.248.254.0/24
10257/tcp                  ALLOW IN    10.248.254.0/24
10259/tcp                  ALLOW IN    10.248.254.0/24
19001/tcp                  ALLOW IN    10.248.254.0/24
22/tcp                     ALLOW IN    Anywhere
5568/tcp                   ALLOW IN    Anywhere
443/tcp                    ALLOW IN    Anywhere
Anywhere on vxlan.calico   ALLOW IN    Anywhere
Anywhere on cali+          ALLOW IN    Anywhere
32000/tcp                  ALLOW IN    10.248.254.0/24
6443/tcp                   ALLOW IN    10.248.254.0/24
8472/udp                   ALLOW IN    10.248.254.0/24
2379/tcp                   ALLOW IN    10.248.254.0/24
2380/tcp                   ALLOW IN    10.248.254.0/24
80/tcp                     ALLOW IN    Anywhere
7946/tcp                   ALLOW IN    10.248.254.0/24
7946/udp                   ALLOW IN    10.248.254.0/24
30000/tcp                  ALLOW IN    Anywhere
30001/tcp                  ALLOW IN    Anywhere
30201/tcp                  ALLOW IN    Anywhere
5201                       ALLOW IN    10.248.254.0/24
51821:51830/udp            ALLOW IN    Anywhere
51820/udp                  ALLOW IN    Anywhere
5201/tcp                   ALLOW IN    10.248.253.0/24
22/tcp (v6)                ALLOW IN    Anywhere (v6)
5568/tcp (v6)              ALLOW IN    Anywhere (v6)
443/tcp (v6)               ALLOW IN    Anywhere (v6)
Anywhere (v6) on vxlan.calico ALLOW IN    Anywhere (v6)
Anywhere (v6) on cali+     ALLOW IN    Anywhere (v6)
80/tcp (v6)                ALLOW IN    Anywhere (v6)
30000/tcp (v6)             ALLOW IN    Anywhere (v6)
30001/tcp (v6)             ALLOW IN    Anywhere (v6)
30201/tcp (v6)             ALLOW IN    Anywhere (v6)
51821:51830/udp (v6)       ALLOW IN    Anywhere (v6)
51820/udp (v6)             ALLOW IN    Anywhere (v6)

Anywhere                   ALLOW OUT   Anywhere on vxlan.calico
Anywhere                   ALLOW OUT   Anywhere on cali+
Anywhere (v6)              ALLOW OUT   Anywhere (v6) on vxlan.calico
Anywhere (v6)              ALLOW OUT   Anywhere (v6) on cali+

I don't know enough about kubernetes networking to see how this affects anything.

square-engine-61315

04/11/2023, 7:14 AM

Do I just need to open something extra in the firewall?

ufw default accept incoming

is not an option for me, I think, because as far as I understand that would open up all ports to all network interfaces, including my VM's internet connection.

square-engine-61315

04/11/2023, 7:27 AM

What's confusing me here is that I don't see any interfaces named

vxlan.calico

cali+

when I run

ip link show

ip addr show

. I do see interfaces named

flannel.1

and

cni0

. Is this normal?

creamy-pencil-82913

04/11/2023, 7:33 AM

Yes

square-engine-61315

04/11/2023, 7:33 AM

Another thing that I don't understand:

Copy code

8: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default
    link/ether 0a:e7:e6:d6:6a:5f brd ff:ff:ff:ff:ff:ff
    inet 10.42.0.0/32 scope global flannel.1
       valid_lft forever preferred_lft forever
    inet6 fe80::8e7:e6ff:fed6:6a5f/64 scope link
       valid_lft forever preferred_lft forever
9: cni0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default qlen 1000
    link/ether 56:ed:81:e8:e7:5b brd ff:ff:ff:ff:ff:ff
    inet 10.42.0.1/24 brd 10.42.0.255 scope global cni0
       valid_lft forever preferred_lft forever
    inet6 fe80::54ed:81ff:fee8:e75b/64 scope link
       valid_lft forever preferred_lft forever

Both of these have a

10.42.*

IP range. But isn't the kubernetes API supposed to be at

10.43.0.1

(

, not

creamy-pencil-82913

04/11/2023, 7:34 AM

See the docs here https://docs.k3s.io/advanced#ubuntu

creamy-pencil-82913

04/11/2023, 7:36 AM

Kube-router used to add rules that bypassed the ufw configuration and allowed all cluster traffic, but due to some changes that's been removed from the most recent releases. You need to ensure you actually allow cluster traffic now if using a host firewall.

💡 1

creamy-pencil-82913

04/11/2023, 7:37 AM

We will probably restore the old behavior for the next round of releases

square-engine-61315

04/11/2023, 7:38 AM

Thanks, I think I understand now! Does the recommendation at https://docs.k3s.io/advanced#ubuntu persist even after the next release?

square-engine-61315

04/11/2023, 7:46 AM

As far as I can see:

Copy code

ufw allow 6443/tcp # 1. This would expose my API server to the internet. Not going to do this.
ufw allow from 10.42.0.0/16 to any # 2. Allow all pods to communicate with my host.
ufw allow from 10.43.0.0/16 to any # 3. Allow all services to communicate with my host.

After applying 2 and 3, it's working again!! 🦜

square-engine-61315

04/11/2023, 7:47 AM

THANK YOU!!

creamy-pencil-82913

04/11/2023, 9:24 AM

Yes, that always should have been done, and still should be done going forward. The bug that was 'fixed' allowed k3s to work even without proper allow rules in place.

2114 Views

Open in Slack

Previous Next