https://rancher.com/ logo
Title
c

cuddly-egg-57762

10/14/2022, 10:45 AM
Hello good people. I'm using k3s v1.23.6+k3s1 with cilium:v1.12.2 on top of Rocky Linux 8.6, in a 3 node cluster (proxmox VMs). Today I saw a very strange behavior: in the first node of the cluster, the k3s service restarted. From the service logs I see there was only this error:
ott 14 10:24:13 node1 k3s[23642]: E1014 10:24:13.249106   23642 leaderelection.go:330] error retrieving resource lock kube-system/cloud-controller-manager: Get "<https://127.0.0.1:6444/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/cloud-controller-manager?timeou>>
ott 14 10:24:13 node1 k3s[23642]: I1014 10:24:13.249252   23642 leaderelection.go:283] failed to renew lease kube-system/cloud-controller-manager: timed out waiting for the condition
ott 14 10:24:13 node1 k3s[23642]: F1014 10:24:13.249332   23642 controllermanager.go:237] leaderelection lost
ott 14 10:24:13 node1 k3s[23642]: E1014 10:24:13.249629   23642 runtime.go:76] Observed a panic: F1014 10:24:13.249332   23642 controllermanager.go:237] leaderelection lost
ott 14 10:24:13 node1 k3s[23642]: goroutine 22078 [running]:
ott 14 10:24:13 node1 k3s[23642]: <http://k8s.io/apimachinery/pkg/util/runtime.logPanic({0x3f70c00|k8s.io/apimachinery/pkg/util/runtime.logPanic({0x3f70c00>, 0xc0230f6340})
ott 14 10:24:13 node1 k3s[23642]:         /go/pkg/mod/github.com/k3s-io/kubernetes/staging/src/k8s.io/apimachinery@v1.23.6-k3s1/pkg/util/runtime/runtime.go:74 +0x7d
ott 14 10:24:13 node1 k3s[23642]: <http://k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0|k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0>, 0x0, 0x1})
ott 14 10:24:13 node1 k3s[23642]:         /go/pkg/mod/github.com/k3s-io/kubernetes/staging/src/k8s.io/apimachinery@v1.23.6-k3s1/pkg/util/runtime/runtime.go:48 +0x75
ott 14 10:24:13 node1 k3s[23642]: panic({0x3f70c00, 0xc0230f6340})
ott 14 10:24:13 node1 k3s[23642]:         /usr/local/go/src/runtime/panic.go:1038 +0x215
ott 14 10:24:13 node1 k3s[23642]: <http://k8s.io/klog/v2.(*loggingT).output(0x7c06200|k8s.io/klog/v2.(*loggingT).output(0x7c06200>, 0x3, 0x0, 0xc02be58700, 0x0, {0x6018774, 0x0}, 0xc00eabd1c0, 0x0)
ott 14 10:24:13 node1 k3s[23642]:         /go/pkg/mod/github.com/k3s-io/klog/v2@v2.30.0-k3s1/klog.go:982 +0x625
ott 14 10:24:13 node1 k3s[23642]: <http://k8s.io/klog/v2.(*loggingT).printf(0x0|k8s.io/klog/v2.(*loggingT).printf(0x0>, 0x0, 0x0, {0x0, 0x0}, {0x4811943, 0x13}, {0x0, 0x0, 0x0})
ott 14 10:24:13 node1 k3s[23642]:         /go/pkg/mod/github.com/k3s-io/klog/v2@v2.30.0-k3s1/klog.go:753 +0x1c5
ott 14 10:24:13 node1 k3s[23642]: <http://k8s.io/klog/v2.Fatalf(...)|k8s.io/klog/v2.Fatalf(...)>
ott 14 10:24:13 node1 k3s[23642]:         /go/pkg/mod/github.com/k3s-io/klog/v2@v2.30.0-k3s1/klog.go:1513
ott 14 10:24:13 node1 k3s[23642]: <http://k8s.io/cloud-provider/app.Run.func3()|k8s.io/cloud-provider/app.Run.func3()>
ott 14 10:24:13 node1 k3s[23642]:         /go/pkg/mod/github.com/k3s-io/kubernetes/staging/src/k8s.io/cloud-provider@v1.23.6-k3s1/app/controllermanager.go:237 +0x55
ott 14 10:24:13 node1 k3s[23642]: <http://k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run.func1()|k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run.func1()>
ott 14 10:24:13 node1 k3s[23642]:         /go/pkg/mod/github.com/k3s-io/kubernetes/staging/src/k8s.io/client-go@v1.23.6-k3s1/tools/leaderelection/leaderelection.go:203 +0x1f
ott 14 10:24:13 node1 k3s[23642]: <http://k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run(0xc00991f440|k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run(0xc00991f440>, {0x51d17b0, 0xc00012a008})
ott 14 10:24:13 node1 k3s[23642]:         /go/pkg/mod/github.com/k3s-io/kubernetes/staging/src/k8s.io/client-go@v1.23.6-k3s1/tools/leaderelection/leaderelection.go:213 +0x189
ott 14 10:24:13 node1 k3s[23642]: <http://k8s.io/client-go/tools/leaderelection.RunOrDie({0x51d17b0|k8s.io/client-go/tools/leaderelection.RunOrDie({0x51d17b0>, 0xc00012a008}, {{0x5223e48, 0xc00eba1cc0}, 0x37e11d600, 0x2540be400, 0x77359400, {0xc0076699e0, 0x4bacd10, 0x0}, ...})
ott 14 10:24:13 node1 k3s[23642]:         /go/pkg/mod/github.com/k3s-io/kubernetes/staging/src/k8s.io/client-go@v1.23.6-k3s1/tools/leaderelection/leaderelection.go:226 +0x94
ott 14 10:24:13 node1 k3s[23642]: <http://k8s.io/cloud-provider/app.leaderElectAndRun(0xc0072e29a8|k8s.io/cloud-provider/app.leaderElectAndRun(0xc0072e29a8>, {0xc00c4fd410, 0x2a}, 0xc007319308, {0x47dabc2, 0x6}, {0x4829eef, 0x18}, {0xc0076699e0, 0x4bacd10, ...})
ott 14 10:24:13 node1 k3s[23642]:         /go/pkg/mod/github.com/k3s-io/kubernetes/staging/src/k8s.io/cloud-provider@v1.23.6-k3s1/app/controllermanager.go:516 +0x2c5
ott 14 10:24:13 node1 k3s[23642]: created by <http://k8s.io/cloud-provider/app.Run|k8s.io/cloud-provider/app.Run>
ott 14 10:24:13 node1 k3s[23642]:         /go/pkg/mod/github.com/k3s-io/kubernetes/staging/src/k8s.io/cloud-provider@v1.23.6-k3s1/app/controllermanager.go:220 +0x7a5
ott 14 10:24:13 node1 k3s[23642]: panic: unreachable
ott 14 10:24:13 node1 k3s[23642]: goroutine 22078 [running]:
ott 14 10:24:13 node1 k3s[23642]: <http://k8s.io/cloud-provider/app.leaderElectAndRun(0xc0072e29a8|k8s.io/cloud-provider/app.leaderElectAndRun(0xc0072e29a8>, {0xc00c4fd410, 0x2a}, 0xc007319308, {0x47dabc2, 0x6}, {0x4829eef, 0x18}, {0xc0076699e0, 0x4bacd10, ...})
ott 14 10:24:13 node1 k3s[23642]:         /go/pkg/mod/github.com/k3s-io/kubernetes/staging/src/k8s.io/cloud-provider@v1.23.6-k3s1/app/controllermanager.go:526 +0x2d8
ott 14 10:24:13 node1 k3s[23642]: created by <http://k8s.io/cloud-provider/app.Run|k8s.io/cloud-provider/app.Run>
ott 14 10:24:13 node1 k3s[23642]:         /go/pkg/mod/github.com/k3s-io/kubernetes/staging/src/k8s.io/cloud-provider@v1.23.6-k3s1/app/controllermanager.go:220 +0x7a5
ott 14 10:24:17 node1 systemd[1]: k3s.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Just for curiosity I check also the kernel messages and I saw this strange stack related to eBPF:
[ 6381.435296] WARNING: CPU: 3 PID: 66903 at mm/vmalloc.c:330 vmalloc_to_page+0x219/0x220
[ 6381.436034] Modules linked in: tcp_diag inet_diag fuse sunrpc cls_bpf sch_ingress xt_TPROXY nf_tproxy_ipv6 nf_tproxy_ipv4 xt_nat xt_set xt_CT veth ip_set_hash_ip xt_socket nf_socket_ipv4 nf_socket_ipv6 ip6table_filter ip6table_raw ip6table_mangle ip6_tables iptable_filter iptable_raw iptable_mangle ip6t_MASQUERADE ipt_MASQUERADE xt_conntrack xt_comment nft_counter xt_mark nft_compat iptable_nat ip_tables dm_thin_pool dm_persistent_data dm_bio_prison dm_snapshot dm_bufio br_netfilter bridge stp llc overlay nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink bochs_drm drm_vram_helper drm_ttm_helper ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm joydev pcspkr virtio_balloon i2c_piix4 xfs libcrc32c sr_mod cdrom sd_mod t10_pi sg ata_generic ata_piix libata virtio_net serio_raw net_failover virtio_console failover virtio_scsi dm_mirror dm_region_hash dm_log dm_mod
[ 6381.436234]  [last unloaded: nft_fib]
[ 6381.442676] Red Hat flags: eBPF/cls eBPF/cgroup
[ 6381.443442] CPU: 3 PID: 66903 Comm: tc Kdump: loaded Not tainted 4.18.0-372.9.1.el8.x86_64 #1
[ 6381.444220] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS <http://rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org|rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org> 04/01/2014
[ 6381.445856] RIP: 0010:vmalloc_to_page+0x219/0x220
[ 6381.446707] Code: 50 ec 00 c3 48 81 e7 00 00 00 c0 e9 1a ff ff ff 48 8b 3d 32 bf ed 00 48 81 e7 00 f0 ff ff 48 89 fa eb 8d 0f 0b e9 15 fe ff ff <0f> 0b 31 c0 c3 66 90 0f 1f 44 00 00 e8 d6 fd ff ff 48 2b 05 ff 4f
[ 6381.448372] RSP: 0018:ffffa3a70de3bca0 EFLAGS: 00010293
[ 6381.449215] RAX: 0000000000000063 RBX: fffff440c41877c0 RCX: 0000000000000000
[ 6381.450091] RDX: 0000000000000000 RSI: ffffffffc0c008bb RDI: 0000000000000000
[ 6381.450970] RBP: ffffffffc0bff8bb R08: 0000000000000000 R09: 0000000000000001
[ 6381.451854] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001
[ 6381.452774] R13: ffffa3a70de3bcef R14: ffffffffc0c008bb R15: 0000000000000001
[ 6381.453647] FS:  00007f61cc987740(0000) GS:ffff96fe73d80000(0000) knlGS:0000000000000000
[ 6381.454571] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 6381.455437] CR2: 000055c5f2725df0 CR3: 0000000106106000 CR4: 00000000000006e0
[ 6381.456367] Call Trace:
[ 6381.457321]  __text_poke+0x203/0x260
[ 6381.458295]  text_poke_bp_batch+0x67/0x160
[ 6381.459135]  ? bpf_prog_c13c61c429dc020a_handle_policy+0x25c3/0x2d08
[ 6381.460008]  text_poke_bp+0x3a/0x54
[ 6381.460830]  ? bpf_prog_c13c61c429dc020a_handle_policy+0x25c3/0x2d08
[ 6381.461658]  __bpf_arch_text_poke+0x19e/0x1b0
[ 6381.462551]  prog_array_map_poke_run+0xe2/0x1d0
[ 6381.463447]  bpf_fd_array_map_update_elem+0x8b/0xe0
[ 6381.464299]  map_update_elem+0x1cf/0x1f0
[ 6381.465148]  __do_sys_bpf+0x92b/0x1da0
[ 6381.465987]  ? audit_log_exit+0x2b2/0xd60
[ 6381.466788]  do_syscall_64+0x5b/0x1a0
[ 6381.467656]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[ 6381.468496] RIP: 0033:0x7f61ccac073d
[ 6381.469318] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 23 37 0d 00 f7 d8 64 89 01 48
[ 6381.471074] RSP: 002b:00007fff72060128 EFLAGS: 00000246 ORIG_RAX: 0000000000000141
[ 6381.471970] RAX: ffffffffffffffda RBX: 00007fff72062530 RCX: 00007f61ccac073d
[ 6381.472845] RDX: 0000000000000080 RSI: 00007fff72060130 RDI: 0000000000000002
[ 6381.473702] RBP: 0000564de994f1f0 R08: 00007fff7206020c R09: 0000000000000008
[ 6381.474554] R10: 0000000000000000 R11: 0000000000000246 R12: 0000564de9901a80
[ 6381.475530] R13: 0000564de994f700 R14: 00007fff72060260 R15: 0000564de9991938
[ 6381.476413] ---[ end trace b293578addd05c82 ]---
It's the first time after months I see this problem: do you know why this is happening? Could the problem be caused by the kernel version (4.18.0-372.9.1.el8.x86_64)?
I see the same behavior in a single node k3s installation, but the eBPF error is not present this time. So the 2 things seems not related.
a

abundant-fish-48120

10/14/2022, 3:24 PM
looks like a simple network issue but I am not familiar with proxmox
connection timeout could be anywhere from a routing table issue to iptables issue
c

cuddly-egg-57762

10/17/2022, 7:09 AM
it looks like the problem was caused by the massive I/O that was happening on the disks. This because all the VMs was sharing a single datastore which was overloaded of I/O. Checking the k3s.service log I've seens a lot or warning regarding the ETCD server replies which were too slow. Givind a dedicated disk to ETCD seems has solved the problem. Thanks a lot for your help @abundant-fish-48120
a

abundant-fish-48120

10/18/2022, 4:22 PM
Disk issues are always fun. Glad you solved it.