This message was deleted.
# general
a
This message was deleted.
w
Copy code
root@prod-cp-04-5b902e16-8nq2d:/usr/local/bin# find / -type f -name "etcdctl"
/var/lib/rancher/rke2/agent/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/750/fs/usr/local/bin/etcdctl
find: '/var/lib/rancher/rke2/agent/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/20832': No such file or directory
/run/k3s/containerd/io.containerd.runtime.v2.task/k8s.io/082ec2b0b596181f7aed0501405467c89eaace474ea3d5465e27817d2579f2be/rootfs/usr/local/bin/etcdctl
what am i missing here?
v1.30.4+rke2r1
Copy code
2024-10-18T01:32:28.10008028Z stderr F time="2024-10-18T01:32:28Z" level=error msg="Could not get ipvs family information from the kernel. It is possible that ipvs is not enabled in your kernel. Native loadbalancing will not work until this is fixed."
2024-10-18T01:32:28.102292846Z stderr F I1018 01:32:28.102086       1 proxier.go:646] "Dummy VS not created" scheduler="rr"
2024-10-18T01:32:28.102496177Z stderr F E1018 01:32:28.102283       1 server.go:558] "Error running ProxyServer" err="can't use the IPVS proxier: Ipvs not supported"
2024-10-18T01:32:28.102993505Z stderr F E1018 01:32:28.102512       1 run.go:74] "command failed" err="can't use the IPVS proxier: Ipvs not supported"
I did an upgrade and haven't been able to create nodes.
Copy code
etcd-expose-metrics: false
      kube-proxy-arg:
        - proxy-mode=ipvs
        - ipvs-strict-arp=true
Is already set. I did diffs of my cillium config when upgrading
ohhh
Copy code
Running modprobe ip_vs failed with message: `modprobe: error while loading shared libraries: libzstd.so.1: cannot open shared object file: No such file or directory`, error: exit status 127"
i'm using the same base image odd
strange
i'm the admin lol
zstd
isn't getting installed on new nodes it seems
ok answer is you can't use jammy
ubuntu-noble-24.04-cloudimg
upgraded to ubuntu-noble-24.04-cloudimg-2024-10-17
this was probably in release notes and i missed it
ok now it just takes > 10 min and deletes the vm
never upgrade. ever.
yep no errors
Copy code
root@prod-workers-dlckv-vbm8j:/var/log/containers# tail -f *
==> cilium-2kgjb_kube-system_install-portmap-cni-plugin-1099be9951a529c80dc2b35de27660038a1206f8b107bc04ee13731f06c7a869.log <==
2024-10-18T03:04:28.054528923Z stdout F loopback is in SKIP_CNI_BINARIES, skipping
2024-10-18T03:04:28.058854779Z stdout F macvlan is in SKIP_CNI_BINARIES, skipping
2024-10-18T03:04:28.084236646Z stdout F copied /opt/cni/bin/portmap to /host/opt/cni/bin correctly
2024-10-18T03:04:28.088001364Z stdout F ptp is in SKIP_CNI_BINARIES, skipping
2024-10-18T03:04:28.091797496Z stdout F sbr is in SKIP_CNI_BINARIES, skipping
2024-10-18T03:04:28.096527088Z stdout F static is in SKIP_CNI_BINARIES, skipping
2024-10-18T03:04:28.118189342Z stdout F copied /opt/cni/bin/tap to /host/opt/cni/bin correctly
2024-10-18T03:04:28.12302963Z stdout F tuning is in SKIP_CNI_BINARIES, skipping
2024-10-18T03:04:28.12748478Z stdout F vlan is in SKIP_CNI_BINARIES, skipping
2024-10-18T03:04:28.131794167Z stdout F vrf is in SKIP_CNI_BINARIES, skipping

==> kube-proxy-prod-workers-dlckv-vbm8j_kube-system_kube-proxy-9134d4449c11622d196ce5536b03dee26c565631abba4bb2bbf403771c7da810.log <==
2024-10-18T03:04:46.996851096Z stderr F I1018 03:04:46.996748       1 server.go:874] "Golang settings" GOGC="" GOMAXPROCS="" GOTRACEBACK=""
2024-10-18T03:04:46.998750378Z stderr F I1018 03:04:46.998475       1 config.go:101] "Starting endpoint slice config controller"
2024-10-18T03:04:46.998783916Z stderr F I1018 03:04:46.998487       1 config.go:192] "Starting service config controller"
2024-10-18T03:04:46.998793619Z stderr F I1018 03:04:46.998504       1 config.go:319] "Starting node config controller"
2024-10-18T03:04:46.998802524Z stderr F I1018 03:04:46.998518       1 shared_informer.go:313] Waiting for caches to sync for endpoint slice config
2024-10-18T03:04:46.998811421Z stderr F I1018 03:04:46.998523       1 shared_informer.go:313] Waiting for caches to sync for service config
2024-10-18T03:04:46.998819964Z stderr F I1018 03:04:46.998527       1 shared_informer.go:313] Waiting for caches to sync for node config
2024-10-18T03:04:47.101770411Z stderr F I1018 03:04:47.099535       1 shared_informer.go:320] Caches are synced for endpoint slice config
2024-10-18T03:04:47.101799013Z stderr F I1018 03:04:47.099617       1 shared_informer.go:320] Caches are synced for node config
2024-10-18T03:04:47.101808444Z stderr F I1018 03:04:47.099654       1 shared_informer.go:320] Caches are synced for service config

==> pushprox-kube-proxy-client-v4sx6_cattle-monitoring-system_pushprox-client-8c700c0ebaa337fc81f4eff85f6cc89e0e1b9fc42411619055e35e43f87309d3.log <==

==> rancher-monitoring-prometheus-node-exporter-zz488_cattle-monitoring-system_node-exporter-90596f4fd33069f83beaa9fcffcd1caa110ee962d904f2f53daba029cc848337.log <==
2024-10-18T03:04:24.413827964Z stderr F ts=2024-10-18T03:04:24.413Z caller=node_exporter.go:117 level=info collector=thermal_zone
2024-10-18T03:04:24.413883731Z stderr F ts=2024-10-18T03:04:24.413Z caller=node_exporter.go:117 level=info collector=time
2024-10-18T03:04:24.413897931Z stderr F ts=2024-10-18T03:04:24.413Z caller=node_exporter.go:117 level=info collector=timex
2024-10-18T03:04:24.413909718Z stderr F ts=2024-10-18T03:04:24.413Z caller=node_exporter.go:117 level=info collector=udp_queues
2024-10-18T03:04:24.413921251Z stderr F ts=2024-10-18T03:04:24.413Z caller=node_exporter.go:117 level=info collector=uname
2024-10-18T03:04:24.413932868Z stderr F ts=2024-10-18T03:04:24.413Z caller=node_exporter.go:117 level=info collector=vmstat
2024-10-18T03:04:24.413944448Z stderr F ts=2024-10-18T03:04:24.413Z caller=node_exporter.go:117 level=info collector=xfs
2024-10-18T03:04:24.413956021Z stderr F ts=2024-10-18T03:04:24.413Z caller=node_exporter.go:117 level=info collector=zfs
2024-10-18T03:04:24.414928911Z stderr F ts=2024-10-18T03:04:24.414Z caller=tls_config.go:274 level=info msg="Listening on" address=[::]:9796
2024-10-18T03:04:24.414960089Z stderr F ts=2024-10-18T03:04:24.414Z caller=tls_config.go:277 level=info msg="TLS is disabled." http2=false address=[::]:9796

==> pushprox-kube-proxy-client-v4sx6_cattle-monitoring-system_pushprox-client-8c700c0ebaa337fc81f4eff85f6cc89e0e1b9fc42411619055e35e43f87309d3.log <==
2024-10-18T03:05:03.693417448Z stderr F ts=2024-10-18T03:05:03.692Z caller=main.go:269 level=info msg="URL and FQDN info" proxy_url=<http://pushprox-kube-proxy-proxy.cattle-monitoring-system.svc:8080/> fqdn=192.168.10.211
2024-10-18T03:05:54.61196902Z stderr F ts=2024-10-18T03:05:54.611Z caller=main.go:232 level=info msg="Got scrape request" scrape_id=71ba4d5b-9259-4ff5-abc1-174fe137fd17 url=<http://192.168.10.211:10249/metrics>
2024-10-18T03:05:54.629214776Z stderr F ts=2024-10-18T03:05:54.627Z caller=main.go:167 level=info scrape_id=71ba4d5b-9259-4ff5-abc1-174fe137fd17 msg="Retrieved scrape response"
2024-10-18T03:05:54.640633631Z stderr F ts=2024-10-18T03:05:54.640Z caller=main.go:173 level=info scrape_id=71ba4d5b-9259-4ff5-abc1-174fe137fd17 msg="Pushed scrape result"
^C
root@prod-workers-dlckv-vbm8j:/var/log/containers# tail -f *
==> cilium-2kgjb_kube-system_apply-sysctl-overwrites-fcb31ecf496f2f71830d395f7707697ceff695fc69d10d7df2a559f60a9395ac.log <==
2024-10-18T03:05:03.550333275Z stdout F sysctl config created/updated
2024-10-18T03:05:03.571730441Z stdout F systemd unit 'systemd-sysctl.service' restarted

==> cilium-2kgjb_kube-system_cilium-agent-03e82a96844e41a904e6729db3626568b1307635f980ce977f5d0a8eafd7f92c.log <==
2024-10-18T03:05:31.434749063Z stderr F time="2024-10-18T03:05:31Z" level=info msg="Updated link /sys/fs/bpf/cilium/devices/cilium_host/links/cil_to_host for program cil_to_host" subsys=datapath-loader
2024-10-18T03:05:31.434800072Z stderr F time="2024-10-18T03:05:31Z" level=info msg="Updated link /sys/fs/bpf/cilium/devices/cilium_host/links/cil_from_host for program cil_from_host" subsys=datapath-loader
2024-10-18T03:05:31.444961727Z stderr F time="2024-10-18T03:05:31Z" level=info msg="Updated link /sys/fs/bpf/cilium/devices/cilium_net/links/cil_to_host for program cil_to_host" subsys=datapath-loader
2024-10-18T03:05:31.455070324Z stderr F time="2024-10-18T03:05:31Z" level=info msg="Updated link /sys/fs/bpf/cilium/devices/ens192/links/cil_from_netdev for program cil_from_netdev" subsys=datapath-loader
2024-10-18T03:05:31.456573676Z stderr F time="2024-10-18T03:05:31Z" level=info msg="Reloaded endpoint BPF program" ciliumEndpointName=/ containerID= containerInterface= datapathPolicyRevision=31 desiredPolicyRevision=31 endpointID=502 identity=1 ipv4= ipv6= k8sPodName=/ subsys=endpoint
2024-10-18T03:05:33.630340864Z stderr F time="2024-10-18T03:05:33Z" level=info msg="Compiled new BPF template" BPFCompilationTime=3.952428938s file-path=/var/run/cilium/state/templates/8c74b47ac848a2eb2b4ad55b251f1021af01a11916691ddcec4081629c420ef1/bpf_lxc.o subsys=datapath-loader
2024-10-18T03:05:33.730543724Z stderr F time="2024-10-18T03:05:33Z" level=info msg="Program cil_from_container attached to device lxc_health using tcx" subsys=datapath-loader
2024-10-18T03:05:33.731071403Z stderr F time="2024-10-18T03:05:33Z" level=info msg="Reloaded endpoint BPF program" ciliumEndpointName=/ containerID= containerInterface= datapathPolicyRevision=0 desiredPolicyRevision=31 endpointID=760 identity=4 ipv4=10.42.144.97 ipv6= k8sPodName=/ subsys=endpoint
2024-10-18T03:05:33.817867344Z stderr F time="2024-10-18T03:05:33Z" level=info msg="Updated link /sys/fs/bpf/cilium/endpoints/760/links/cil_from_container for program cil_from_container" subsys=datapath-loader
2024-10-18T03:05:33.818347565Z stderr F time="2024-10-18T03:05:33Z" level=info msg="Reloaded endpoint BPF program" ciliumEndpointName=/ containerID= containerInterface= datapathPolicyRevision=31 desiredPolicyRevision=31 endpointID=760 identity=4 ipv4=10.42.144.97 ipv6= k8sPodName=/ subsys=endpoint

==> cilium-2kgjb_kube-system_clean-cilium-state-267395c65e4f15544378f0c4d6c234aa9019c2890a99b0e2746b2b6630a1ff1a.log <==

==> cilium-2kgjb_kube-system_config-fbc2ab79bf0bd2878ad7df6291d7cfa2b282d2fcd3905ace9961cb13a7683f54.log <==
2024-10-18T03:04:56.840115888Z stderr F time="2024-10-18T03:04:56Z" level=info msg="Establishing connection to apiserver" host="<https://10.43.0.1:443>" subsys=k8s-client
2024-10-18T03:04:56.861409014Z stderr F time="2024-10-18T03:04:56Z" level=info msg="Connected to apiserver" subsys=k8s-client
2024-10-18T03:04:56.865321631Z stderr F time="2024-10-18T03:04:56Z" level=info msg="Reading configuration from config-map:kube-system/cilium-config" configSource="config-map:kube-system/cilium-config" subsys=option-resolver
2024-10-18T03:04:56.87501038Z stderr F time="2024-10-18T03:04:56Z" level=info msg="Got 139 config pairs from source" configSource="config-map:kube-system/cilium-config" subsys=option-resolver
2024-10-18T03:04:56.87506166Z stderr F time="2024-10-18T03:04:56Z" level=info msg="Reading configuration from cilium-node-config:kube-system/" configSource="cilium-node-config:kube-system/" subsys=option-resolver
2024-10-18T03:04:56.895834868Z stderr F W1018 03:04:56.895396       1 warnings.go:70] <http://cilium.io/v2alpha1|cilium.io/v2alpha1> CiliumNodeConfig will be deprecated in cilium v1.16; use <http://cilium.io/v2|cilium.io/v2> CiliumNodeConfig
2024-10-18T03:04:56.896202622Z stderr F time="2024-10-18T03:04:56Z" level=info msg="Got 0 config pairs from source" configSource="cilium-node-config:kube-system/" subsys=option-resolver
2024-10-18T03:04:56.944118135Z stderr F 2024/10/18 03:04:56 INFO Started duration=103.87983ms
2024-10-18T03:04:56.945060402Z stderr F 2024/10/18 03:04:56 INFO Stopping
2024-10-18T03:04:56.945094104Z stderr F 2024/10/18 03:04:56 INFO health.job-module-status-metrics (rev=2) module=health
Copy code
==> cilium-2kgjb_kube-system_install-cni-binaries-73557773392eec7e0a007aee6a7a405f7e919c487ab41309c4436bfec1e2d1a0.log <==
2024-10-18T03:05:06.802529876Z stdout F Installing loopback to /host/opt/cni/bin/loopback ...
2024-10-18T03:05:06.83234181Z stdout F Wrote /host/opt/cni/bin/loopback
2024-10-18T03:05:06.838995105Z stdout F Installing cilium-cni to /host/opt/cni/bin/cilium-cni ...
2024-10-18T03:05:07.130704627Z stdout F Wrote /host/opt/cni/bin/cilium-cni

==> cilium-2kgjb_kube-system_install-portmap-cni-plugin-1099be9951a529c80dc2b35de27660038a1206f8b107bc04ee13731f06c7a869.log <==
2024-10-18T03:04:28.054528923Z stdout F loopback is in SKIP_CNI_BINARIES, skipping
2024-10-18T03:04:28.058854779Z stdout F macvlan is in SKIP_CNI_BINARIES, skipping
2024-10-18T03:04:28.084236646Z stdout F copied /opt/cni/bin/portmap to /host/opt/cni/bin correctly
2024-10-18T03:04:28.088001364Z stdout F ptp is in SKIP_CNI_BINARIES, skipping
2024-10-18T03:04:28.091797496Z stdout F sbr is in SKIP_CNI_BINARIES, skipping
2024-10-18T03:04:28.096527088Z stdout F static is in SKIP_CNI_BINARIES, skipping
2024-10-18T03:04:28.118189342Z stdout F copied /opt/cni/bin/tap to /host/opt/cni/bin correctly
2024-10-18T03:04:28.12302963Z stdout F tuning is in SKIP_CNI_BINARIES, skipping
2024-10-18T03:04:28.12748478Z stdout F vlan is in SKIP_CNI_BINARIES, skipping
2024-10-18T03:04:28.131794167Z stdout F vrf is in SKIP_CNI_BINARIES, skipping

==> cilium-2kgjb_kube-system_mount-bpf-fs-d4bd6411ff67a6968b080c0e4be9ca562c1f14e091b3d2bfdd22e9eab1068261.log <==
2024-10-18T03:05:04.524331438Z stdout F bpf on /sys/fs/bpf type bpf (rw,nosuid,nodev,noexec,relatime,mode=700)

==> cilium-2kgjb_kube-system_mount-cgroup-7161654cca4edb73efda9a69dd5341d667e16448570bd599e71e4af0b6931144.log <==
2024-10-18T03:05:02.793917307Z stderr F time="2024-10-18T03:05:02Z" level=info msg="Mounted cgroupv2 filesystem at /run/cilium/cgroupv2" subsys=cgroups

==> kube-proxy-prod-workers-dlckv-vbm8j_kube-system_kube-proxy-9134d4449c11622d196ce5536b03dee26c565631abba4bb2bbf403771c7da810.log <==
2024-10-18T03:04:46.996851096Z stderr F I1018 03:04:46.996748       1 server.go:874] "Golang settings" GOGC="" GOMAXPROCS="" GOTRACEBACK=""
2024-10-18T03:04:46.998750378Z stderr F I1018 03:04:46.998475       1 config.go:101] "Starting endpoint slice config controller"
2024-10-18T03:04:46.998783916Z stderr F I1018 03:04:46.998487       1 config.go:192] "Starting service config controller"
2024-10-18T03:04:46.998793619Z stderr F I1018 03:04:46.998504       1 config.go:319] "Starting node config controller"
2024-10-18T03:04:46.998802524Z stderr F I1018 03:04:46.998518       1 shared_informer.go:313] Waiting for caches to sync for endpoint slice config
2024-10-18T03:04:46.998811421Z stderr F I1018 03:04:46.998523       1 shared_informer.go:313] Waiting for caches to sync for service config
2024-10-18T03:04:46.998819964Z stderr F I1018 03:04:46.998527       1 shared_informer.go:313] Waiting for caches to sync for node config
2024-10-18T03:04:47.101770411Z stderr F I1018 03:04:47.099535       1 shared_informer.go:320] Caches are synced for endpoint slice config
2024-10-18T03:04:47.101799013Z stderr F I1018 03:04:47.099617       1 shared_informer.go:320] Caches are synced for node config
2024-10-18T03:04:47.101808444Z stderr F I1018 03:04:47.099654       1 shared_informer.go:320] Caches are synced for service config

==> pushprox-kube-proxy-client-v4sx6_cattle-monitoring-system_pushprox-client-8c700c0ebaa337fc81f4eff85f6cc89e0e1b9fc42411619055e35e43f87309d3.log <==
2024-10-18T03:05:03.693417448Z stderr F ts=2024-10-18T03:05:03.692Z caller=main.go:269 level=info msg="URL and FQDN info" proxy_url=<http://pushprox-kube-proxy-proxy.cattle-monitoring-system.svc:8080/> fqdn=192.168.10.211
2024-10-18T03:05:54.61196902Z stderr F ts=2024-10-18T03:05:54.611Z caller=main.go:232 level=info msg="Got scrape request" scrape_id=71ba4d5b-9259-4ff5-abc1-174fe137fd17 url=<http://192.168.10.211:10249/metrics>
2024-10-18T03:05:54.629214776Z stderr F ts=2024-10-18T03:05:54.627Z caller=main.go:167 level=info scrape_id=71ba4d5b-9259-4ff5-abc1-174fe137fd17 msg="Retrieved scrape response"
2024-10-18T03:05:54.640633631Z stderr F ts=2024-10-18T03:05:54.640Z caller=main.go:173 level=info scrape_id=71ba4d5b-9259-4ff5-abc1-174fe137fd17 msg="Pushed scrape result"

==> rancher-monitoring-prometheus-node-exporter-zz488_cattle-monitoring-system_node-exporter-90596f4fd33069f83beaa9fcffcd1caa110ee962d904f2f53daba029cc848337.log <==
2024-10-18T03:04:24.413827964Z stderr F ts=2024-10-18T03:04:24.413Z caller=node_exporter.go:117 level=info collector=thermal_zone
2024-10-18T03:04:24.413883731Z stderr F ts=2024-10-18T03:04:24.413Z caller=node_exporter.go:117 level=info collector=time
2024-10-18T03:04:24.413897931Z stderr F ts=2024-10-18T03:04:24.413Z caller=node_exporter.go:117 level=info collector=timex
2024-10-18T03:04:24.413909718Z stderr F ts=2024-10-18T03:04:24.413Z caller=node_exporter.go:117 level=info collector=udp_queues
2024-10-18T03:04:24.413921251Z stderr F ts=2024-10-18T03:04:24.413Z caller=node_exporter.go:117 level=info collector=uname
2024-10-18T03:04:24.413932868Z stderr F ts=2024-10-18T03:04:24.413Z caller=node_exporter.go:117 level=info collector=vmstat
2024-10-18T03:04:24.413944448Z stderr F ts=2024-10-18T03:04:24.413Z caller=node_exporter.go:117 level=info collector=xfs
2024-10-18T03:04:24.413956021Z stderr F ts=2024-10-18T03:04:24.413Z caller=node_exporter.go:117 level=info collector=zfs
2024-10-18T03:04:24.414928911Z stderr F ts=2024-10-18T03:04:24.414Z caller=tls_config.go:274 level=info msg="Listening on" address=[::]:9796
2024-10-18T03:04:24.414960089Z stderr F ts=2024-10-18T03:04:24.414Z caller=tls_config.go:277 level=info msg="TLS is disabled." http2=false address=[::]:9796

==> cilium-2kgjb_kube-system_cilium-agent-03e82a96844e41a904e6729db3626568b1307635f980ce977f5d0a8eafd7f92c.log <==
2024-10-18T03:06:10.763798752Z stderr F time="2024-10-18T03:06:10Z" level=info msg="regenerating all endpoints" reason="one or more identities created or deleted" subsys=endpoint-manager
2024-10-18T03:06:13.430771964Z stderr F time="2024-10-18T03:06:13Z" level=info msg="regenerating all endpoints" reason="one or more identities created or deleted" subsys=endpoint-manager
2024-10-18T03:06:14.430380732Z stderr F time="2024-10-18T03:06:14Z" level=info msg="regenerating all endpoints" reason="one or more identities created or deleted" subsys=endpoint-manager
Copy code
/var/lib/rancher/rke2/bin/crictl ps
CONTAINER           IMAGE               CREATED             STATE               NAME                ATTEMPT             POD ID              POD
03e82a96844e4       0fba21db2995c       3 minutes ago       Running             cilium-agent        0                   8d91bff29b0fd       cilium-2kgjb
9134d4449c116       074a06fa52dfe       3 minutes ago       Running             kube-proxy          0                   39d28b02b1512       kube-proxy-prod-workers-dlckv-vbm8j
90596f4fd3306       72c9c20889862       4 minutes ago       Running             node-exporter       0                   8c23479a14510       rancher-monitoring-prometheus-node-exporter-zz488
8c700c0ebaa33       af3e37f58a5fe       4 minutes ago       Running             pushprox-client     0                   00782d135b523       pushprox-kube-proxy-client-v4sx6
journalctl -u rke2-agent
has no errors.
oo found one!
Copy code
Oct 18 03:04:16 prod-workers-dlckv-vbm8j rke2[3312]: time="2024-10-18T03:04:16Z" level=error msg="Error encountered while importing /var/lib/rancher/rke2/agent/images/runtime-image.txt: failed to pull images from /var/lib/rancher/rke2/agent/images/runtime-image.txt: im>
c
we have never included etcdctl as a standalone binary with RKE2 in the same way we do kubectl or crictl. Were you really finding it inside one of the container images in the containerd image store and running it from the host!? That only works if the binary in the image is statically linked or linked against shared libraries that are available on your node. Which is NOT a safe assumption.
what are you trying to do with that anyway
w
so, whenever nodes fail to start and churn creates and deletes, in the paste it's been etcd like 99% of the time. like ghost nodes that don't exist but are still in membership
Copy code
etcdctl --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key --endpoints <https://127.0.0.1:2379/> --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt member list
to check. 3 nodes in rancher and >3 in etcd then i have had in the paste to remove the bad entry
anyway, i used to be able to ssh from rancher UI and run that
my real problem is that nodes can't be created.
also, it's not etcd 😄
rke-agent
Oct 18 031653 prod-workers-dd47cfcd4xc6c85-jqvt4 rke2[3170]: time="2024-10-18T031653Z" level=info msg="rke2 agent is up and running" Oct 18 031653 prod-workers-dd47cfcd4xc6c85-jqvt4 systemd[1]: Started Rancher Kubernetes Engine v2 (agent).
i get there.
Copy code
orting /var/lib/rancher/rke2/agent/images/runtime-image.txt: failed to pull images from /var/lib/rancher/rke2/agent/images/runtime-image.txt: image \"<http://index.docker.io/rancher/rke2-runtime:v1.30.4-rke2r1\|index.docker.io/rancher/rke2-runtime:v1.30.4-rke2r1\>": not found"
hmm I can pull it just fine?
Copy code
2024-10-18T03:29:12.012612806Z stderr F ts=2024-10-18T03:29:12.012Z caller=main.go:222 level=error msg="Error polling:" err="Post \"<http://pushprox-kube-proxy-proxy.cattle-monitoring-system.svc:8080/poll>\": dial tcp: lookup pushprox-kube-proxy-proxy.cattle-monitoring-sy
pretty sure you're the only committer Brandon
cilium shows
Copy code
CrashLoopBackOff (back-off 40s restarting failed container=config pod=cilium-g8d74_kube-system(f3a60b8a-6185-472e-854b-87b90dcaf304)) | Last state: Terminated with 1: Error (Running 2024/10/18 04:02:29 INFO Starting time=&quot;2024-10-18T04:02:29Z&quot; level=info msg=&quot;Establishing connection to apiserver&quot; host=&quot;<https://10.43.0.1:443>&quot; subsys=k8s-client time=&quot;2024-10-18T04:03:04Z&quot; level=info msg=&quot;Establishing connection to apiserver&quot; host=&quot;<https://10.43.0.1:443>&quot; subsys=k8s-client time=&quot;2024-10-18T04:03:34Z&quot; level=error msg=&quot;Unable to contact k8s api-server&quot; error=&quot;Get \&quot;<https://10.43.0.1:443/api/v1/namespaces/kube-system>\&quot;: dial tcp 10.43.0.1:443: i/o timeout&quot; ipAddr=&quot;<https://10.43.0.1:443>&quot; subsys=k8s-client 2024/10/18 04:03:34 ERROR Start hook failed function=&quot;client.(*compositeClientset).onStart (k8s-client)&quot; error=&quot;Get \&quot;<https://10.43.0.1:443/api/v1/namespaces/kube-system>\&quot;: dial tcp 10.43.0.1:443: i/o timeout&quot; 2024/10/18 04:03:34 ERROR Start failed error=&quot;Get \&quot;<https://10.43.0.1:443/api/v1/namespaces/kube-system>\&quot;: dial tcp 10.43.0.1:443: i/o timeout&quot; duration=1m5.011458077s 2024/10/18 04:03:34 INFO Stopping Error: Build config failed: failed to start: Get &quot;<https://10.43.0.1:443/api/v1/namespaces/kube-system&quot>;: dial tcp 10.43.0.1:443: i/o timeout ), started: Fri, Oct 18 2024 12:02:29 am, finished: Fri, Oct 18 2024 12:03:34 am
c
You can ignore those cannot pull errors from the .txt files. They're bogus. I thought I'd addressed it but apparently not.
Is kube-proxy running on these nodes? Are you using the cilium kube proxy replacement?
w
let me check
yes kube-proxy is running
c
That’s what handles programming iptables on the node to manage connectivity between pods and cluster service endpoints, including the in-cluster apiserver endpoint at
10.43.0.1:443
if kube-proxy is running, and the apiserver is up, then I’d check connectivity between the node in question and the apiserver nodes.
w
so, everything is the same aside from the updates, and all nodes that existsed pre-upgrade are fine
That is my cilium config
i upgraded through each version of rancher one at a time, then k8s one at a time, per versioning policy. ensuring i was within min/max versions for each (via the matrix)
I diffed the cilium config each time as it can change in rancher ui. I only brought forward things like
null
=>
[]
or
{}
and of course image tags etc and new config
problem seems to be cilium failing to start
@clever-dusk-66120 kube-proxy is not running. i have kublet push-prox
Copy code
root@prod-workers-dd47cfcd4xc6c85-s6x4g:/var/log/containers# cat kube-proxy-prod-workers-dd47cfcd4xc6c85-s6x4g_kube-system_kube-proxy-95407588e6f3ca6fc13a7001c899a472f5a7ce53f0a86386b1831a0126ca89d2.log
2024-10-22T22:06:18.89262595Z stderr F I1022 22:06:18.892024       1 server.go:1062] "Successfully retrieved node IP(s)" IPs=["192.168.11.102"]
2024-10-22T22:06:19.004293519Z stderr F I1022 22:06:19.003852       1 server.go:659] "kube-proxy running in dual-stack mode" primary ipFamily="IPv4"
2024-10-22T22:06:19.00697528Z stderr F time="2024-10-22T22:06:19Z" level=warning msg="Running modprobe ip_vs failed with message: `modprobe: error while loading shared libraries: libzstd.so.1: cannot open shared object file: No such file or directory`, error: exit status 127"
2024-10-22T22:06:19.014651358Z stderr F time="2024-10-22T22:06:19Z" level=error msg="Could not get ipvs family information from the kernel. It is possible that ipvs is not enabled in your kernel. Native loadbalancing will not work until this is fixed."
2024-10-22T22:06:19.017635785Z stderr F I1022 22:06:19.017183       1 proxier.go:646] "Dummy VS not created" scheduler="rr"
2024-10-22T22:06:19.017676246Z stderr F E1022 22:06:19.017261       1 server.go:558] "Error running ProxyServer" err="can't use the IPVS proxier: Ipvs not supported"
2024-10-22T22:06:19.017708694Z stderr F E1022 22:06:19.017323       1 run.go:74] "command failed" err="can't use the IPVS proxier: Ipvs not supported"
dmesg | grep ipvs
is empty
this all used to work. So this is with
jammy
Copy code
# cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=22.04
DISTRIB_CODENAME=jammy
DISTRIB_DESCRIPTION="Ubuntu 22.04.4 LTS"

# modprobe ipvs
modprobe: FATAL: Module ipvs not found in directory /lib/modules/5.15.0-105-generic
with noble,
Copy code
[  137.441108] IPVS: Registered protocols (TCP, UDP, SCTP, AH, ESP)
[  137.441149] IPVS: Connection hash table configured (size=4096, memory=32Kbytes)
[  137.441258] IPVS: ipvs loaded.
[  137.458691] IPVS: [rr] scheduler registered.
[  137.459841] IPVS: starting estimator thread 0...
[  137.522567] IPVS: using max 3984 ests per chain, 199200 per kthread
ok, so kube proxy is running here
Ok here we go 😄
Copy code
==> apply-system-agent-upgrader-on-prod-workers-dlckv-9s5lf-w-wsrnh_cattle-system_upgrade-7192caf6072216187325ad879e63ff76eafafb5f17c73574d5e70532dc29af70.log <==
2024-10-22T22:41:10.059179477Z stderr F + CATTLE_AGENT_UNINSTALL_LOCAL=true
2024-10-22T22:41:10.05919205Z stderr F + export CATTLE_AGENT_BINARY_LOCAL_LOCATION=/var/lib/rancher/agent/tmp/tmp.FsJlsYW9Q2/rancher-system-agent
2024-10-22T22:41:10.059203981Z stderr F + CATTLE_AGENT_BINARY_LOCAL_LOCATION=/var/lib/rancher/agent/tmp/tmp.FsJlsYW9Q2/rancher-system-agent
2024-10-22T22:41:10.059223065Z stderr F + export CATTLE_AGENT_UNINSTALL_LOCAL_LOCATION=/var/lib/rancher/agent/tmp/tmp.FsJlsYW9Q2/rancher-system-agent-uninstall.sh
2024-10-22T22:41:10.059244439Z stderr F + CATTLE_AGENT_UNINSTALL_LOCAL_LOCATION=/var/lib/rancher/agent/tmp/tmp.FsJlsYW9Q2/rancher-system-agent-uninstall.sh
2024-10-22T22:41:10.059257132Z stderr F + '[' -s /host/etc/systemd/system/rancher-system-agent.env ']'
2024-10-22T22:41:10.059270263Z stderr F + chroot /host /var/lib/rancher/agent/tmp/tmp.FsJlsYW9Q2/install.sh
2024-10-22T22:41:10.084350706Z stderr F [FATAL]  You must select at least one role.
2024-10-22T22:41:10.08605501Z stderr F + cleanup
2024-10-22T22:41:10.086088531Z stderr F + rm -rf /host/var/lib/rancher/agent/tmp/tmp.FsJlsYW9Q2
c
did you manually put it in ipvs mode? RKE2 runs kube-proxy in iptables mode by default. If you switched it to ipvs mode, but then don’t have the correct kernel modules on your nodes, that would be a problem.
w
no I did not. I didn't change my cilium/kube proxy config. So apparently, the kernel in jammy is too old for the newer rke2/k8s? Unsure. But the issue was the upgrade plan. I for some reason was using an k3s upgrade plan. Probably was late and i blindly copied and pasted and didn't modify 😞
i deleted the plans and it looks like i progressed
c
> 2024-10-22T220619.017676246Z stderr F E1022 220619.017261 1 server.go:558] “Error running ProxyServer” err=“can’t use the IPVS proxier: Ipvs not supported” > 2024-10-22T220619.017708694Z stderr F E1022 220619.017323 1 run.go:74] “command failed” err=“can’t use the IPVS proxier: Ipvs not supported” Someone has intentionally put kube-proxy in IPVS mode. It does not use ipvs by default. Check your kube-proxy-args in the RKE2 config.
w
in cilium values.yaml?
c
idk, is that from the kube-proxy pod log, or a cilium pod?
w
that was kube proxy
c
ok, so that’d be in /etc/rancher/rke2/config.yaml or a file under /etc/rancher/rke2/config.yaml.d/
w
Copy code
machineGlobalConfig:
      cni: cilium
      disable-kube-proxy: false
      etcd-expose-metrics: false
      kube-proxy-arg:
        - proxy-mode=ipvs
        - ipvs-strict-arp=true
      kube-proxy-extra-mount:
        - /lib/modules:/lib/modules:ro
c
welp, like I said you put it in ipvs mode
w
it's been like that for a few years
can ipvs work with kube proxy? or only when no kube proxy?
c
apparently in previous years your nodes had the correct ipvs kernel modules loaded
it can, if you load the kernel modules
w
yeah, i'm saying i haven't touched my images, i use the raw ubuntu cloud images. The upgrade controller plans were wrong, which broke stuff. But now i deleted the plans and am here now. So it seems to be working now. except node not ready. let me check rke2 logs
journalctl -u rke2-server
looks fine
Copy code
1 handler_proxy.go:93] no RequestInfo found in the context
2024-10-22T23:31:22.666351775Z stderr F E1022 23:31:22.666169       1 controller.go:146] Error updating APIService "<http://v1.packages.operators.coreos.com|v1.packages.operators.coreos.com>" with err: failed to download <http://v1.packages.operators.coreos.com|v1.packages.operators.coreos.com>: failed to retrieve openAPI spec, http error: ResponseCode: 503, Body: service unavailable
2024-10-22T23:31:22.666365696Z stderr F , Header: map[Content-Type:[text/plain; charset=utf-8] X-Content-Type-Options:[nosniff]]
2024-10-22T23:31:22.690183818Z stderr F E1022 23:31:22.689489       1 available_controller.go:460] v1.packages.operators.coreos.c
yeah
Copy code
2024-10-22T23:33:05.509294141Z stderr F E1022 23:33:05.508966       1 handler_proxy.go:137] error resolving kasten-io/aggregatedapis-svc: no endpoints available for service "aggregatedapis-svc"
2024-10-22T23:33:05.509362827Z stderr F E1022 23:33:05.509141       1 handler_proxy.go:137] error resolving kube-system/rke2-metrics-server: no endpoints available for service "rke2-metrics-server"
2024-10-22T23:33:05.509373402Z stderr F E1022 23:33:05.508990       1 handler_proxy.go:137] error resolving cattle-monitoring-system/rancher-monitoring-prometheus-adapter: no endpoints available for service "rancher-monitoring-prometheus-adapter"
2024-10-22T23:33:05.509409316Z stderr F E1022 23:33:05.509226       1 handler_proxy.go:137] error resolving kasten-io/aggregatedapis-svc: no endpoints available for service "aggregatedapis-svc"
2024-10-22T23:33:05.509430565Z stderr F E1022 23:33:05.509307       1 handler_proxy.go:137] error resolving kasten-io/aggregatedapis-svc: no endpoints available for service "aggregatedapis-svc"
2024-10-22T23:33:05.509528621Z stderr F E1022 23:33:05.509398       1 handler_proxy.go:137] error resolving kasten-io/aggregatedapis-svc: no endpoints available for service "aggregatedapis-svc"
something is up with api server this is an api server only node
c
There is a CRD (
<http://v1.packages.operators.coreos.com|v1.packages.operators.coreos.com>
) that has api aggregation enabled, and it can’t connect to the backing service to update the openapi spec. Are the kasten pods up?
w
no.
c
well that would be why there are no endpoints
w
this is an existing cluster.
c
that doesn’t mean the apiserver isn’t up, it just means its trying to update some stuff and it cant
w
ahh ok
so what makes the node marked as ready in rancher?
c
the node or the cluster?
w
node. See second image above
k8s says node is ready, but rancher is waiting
c
is rancher-system-agent service running on the node? is the cattle-cluster-agent pod running in the cluster? are there errors in the logs for either?
start checking rancher component logs
so that's a negative i guess
ok cattle cluster agent is running
it's applying crds
ok i think i see what's going on. The old nodes pre-upgrade of rke2/rancher/k8s cilium is broken because ipvs stopped working suddenly. so i guess spin up new nodes and go from there
ok, next up from kube-sys
Copy code
Failed calling webhook, failing closed <http://rancher.cattle.io.clusters.management.cattle.io|rancher.cattle.io.clusters.management.cattle.io>: failed calling webhook "<http://rancher.cattle.io.clusters.management.cattle.io|rancher.cattle.io.clusters.management.cattle.io>": failed to call webhook: Post "<https://rancher-webhook.cattle-system.svc:443/v1/webhook/mutation/clusters.management.cattle.io?timeout=10s>": no endpoints available for service "rancher-webhook"
yeah so it's totally RKE2/rancher here.
Using the same exact vm image (jammy cloud image) i used pre-upgrade of rke2/rancher, and a restore of etcd+cluster+k8s version backup from pre-rke2/rancher upgrade and it just sits there
it's not even now getting to the point of starting containers up. so this isn't a cilium thing i think
as i can make new clusters with the defaults (+vsphere cspi) and new noble ubuntu image