https://rancher.com/ logo
Title
c

chilly-telephone-51989

01/05/2023, 10:14 AM
Hello I'm facing an issue very recently. A k3s cloud with one master and 2 worker nodes has stopped responding. I restarted k3s and here is the log
● k3s.service - Lightweight Kubernetes
     Loaded: loaded (/etc/systemd/system/k3s.service; enabled; vendor preset: enabled)
     Active: active (running) since Thu 2023-01-05 10:01:20 UTC; 9min ago
       Docs: <https://k3s.io>
    Process: 963939 ExecStartPre=/bin/sh -xc ! /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service (code=exited, status=0/SUCCESS)
    Process: 963941 ExecStartPre=/sbin/modprobe br_netfilter (code=exited, status=0/SUCCESS)
    Process: 963942 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS)
   Main PID: 963943 (k3s-server)
      Tasks: 88
     Memory: 679.2M
        CPU: 46.703s
     CGroup: /system.slice/k3s.service
             ├─  1626 /var/lib/rancher/k3s/data/577968fa3d58539cc4265245941b7be688833e6bf5ad7869fa2afe02f15f1cd2/bin/containerd-shim-runc-v2 -namespace <http://k8s.io|k8s.io> -id 2a9aca2d3>
             ├─  1809 /var/lib/rancher/k3s/data/577968fa3d58539cc4265245941b7be688833e6bf5ad7869fa2afe02f15f1cd2/bin/containerd-shim-runc-v2 -namespace <http://k8s.io|k8s.io> -id 4034e2e93>
             ├─  2269 /var/lib/rancher/k3s/data/577968fa3d58539cc4265245941b7be688833e6bf5ad7869fa2afe02f15f1cd2/bin/containerd-shim-runc-v2 -namespace <http://k8s.io|k8s.io> -id 31f1caa57>
             ├─963482 /var/lib/rancher/k3s/data/577968fa3d58539cc4265245941b7be688833e6bf5ad7869fa2afe02f15f1cd2/bin/containerd-shim-runc-v2 -namespace <http://k8s.io|k8s.io> -id 76b5f8348>
             ├─963943 "/usr/local/bin/k3s server"
             └─963959 containerd -c /var/lib/rancher/k3s/agent/etc/containerd/config.toml -a /run/k3s/containerd/containerd.sock --state /run/k3s/containerd --root /var/lib>

Jan 05 10:01:34 ip-172-31-46-55 k3s[963943]: I0105 10:01:34.384632  963943 shared_informer.go:262] Caches are synced for garbage collector
Jan 05 10:01:34 ip-172-31-46-55 k3s[963943]: I0105 10:01:34.384666  963943 garbagecollector.go:158] Garbage collector: all resource monitors have synced. Proceeding to coll>
Jan 05 10:01:34 ip-172-31-46-55 k3s[963943]: I0105 10:01:34.454332  963943 shared_informer.go:262] Caches are synced for garbage collector
Jan 05 10:06:31 ip-172-31-46-55 k3s[963943]: I0105 10:06:31.679514  963943 trace.go:205] Trace[110892778]: "Get" url:/api/v1/namespaces/xplorie/pods/gateway-86c6cc8bf4-fjnr>
Jan 05 10:06:31 ip-172-31-46-55 k3s[963943]: Trace[110892778]: ---"Writing http response done" 6201ms (10:06:31.679)
Jan 05 10:06:31 ip-172-31-46-55 k3s[963943]: Trace[110892778]: [6.204322629s] [6.204322629s] END
Jan 05 10:07:27 ip-172-31-46-55 k3s[963943]: I0105 10:07:27.025536  963943 trace.go:205] Trace[1243742719]: "Get" url:/api/v1/namespaces/xplorie/pods/web-7df799b896-kmjwl/l>
Jan 05 10:07:27 ip-172-31-46-55 k3s[963943]: Trace[1243742719]: ---"Writing http response done" 35215ms (10:07:27.025)
Jan 05 10:07:27 ip-172-31-46-55 k3s[963943]: Trace[1243742719]: [35.217361472s] [35.217361472s] END
Jan 05 10:09:32 ip-172-31-46-55 k3s[963943]: I0105 10:09:32.845634  963943 log.go:195] http: TLS handshake error from 127.0.0.1:39796: read tcp 127.0.0.1:10250->127.0.0.1:3
These servers are AWS EC machines
Jan 05 10:09:32 ip-172-31-46-55 k3s[963943]: I0105 10:09:32.845634  963943 log.go:195] http: TLS handshake error from 127.0.0.1:39796: read tcp 127.0.0.1:10250->127.0.0.1:39796: read: connection reset by peer
we have enabled the defined ports 39769 and 10250 . That error is gone but cloud is still unresponsive. that is performing a curl just times out. pods seems to be running fine, ingress is also working fine but every request just times out even when tried using private / local IP
by opening ports the problem is kind of solved that we are able to visit the cloud from Node1 and Node2, public & private both. However we are not able to visit the cloud from the private & public IP of master node. also the load balancer IP always times out. Any idea what could be wrong in this case? here is the ingress detail
$ kn get ingress
NAME      CLASS    HOSTS   ADDRESS                                 PORTS   AGE
ingress   <none>   *       172.31.34.0,172.31.41.97,172.31.46.55   80      111d
Master: 172.31.46.55 Node1 : 172.31.34.0 Node2 :172.31.41.97 it runs fine on Node1 7 Node2 now. but times out on Master1