This message was deleted Rancher Users #general

Join Slack

This message was deleted.

# general

adamant-kite-43734

10/18/2024, 1:22 AM

This message was deleted.

wonderful-rain-13345

10/18/2024, 1:25 AM

Copy code

root@prod-cp-04-5b902e16-8nq2d:/usr/local/bin# find / -type f -name "etcdctl"
/var/lib/rancher/rke2/agent/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/750/fs/usr/local/bin/etcdctl
find: '/var/lib/rancher/rke2/agent/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/20832': No such file or directory
/run/k3s/containerd/io.containerd.runtime.v2.task/k8s.io/082ec2b0b596181f7aed0501405467c89eaace474ea3d5465e27817d2579f2be/rootfs/usr/local/bin/etcdctl

wonderful-rain-13345

10/18/2024, 1:25 AM

what am i missing here?

wonderful-rain-13345

10/18/2024, 1:35 AM

v1.30.4+rke2r1

wonderful-rain-13345

10/18/2024, 1:39 AM

Copy code

2024-10-18T01:32:28.10008028Z stderr F time="2024-10-18T01:32:28Z" level=error msg="Could not get ipvs family information from the kernel. It is possible that ipvs is not enabled in your kernel. Native loadbalancing will not work until this is fixed."
2024-10-18T01:32:28.102292846Z stderr F I1018 01:32:28.102086       1 proxier.go:646] "Dummy VS not created" scheduler="rr"
2024-10-18T01:32:28.102496177Z stderr F E1018 01:32:28.102283       1 server.go:558] "Error running ProxyServer" err="can't use the IPVS proxier: Ipvs not supported"
2024-10-18T01:32:28.102993505Z stderr F E1018 01:32:28.102512       1 run.go:74] "command failed" err="can't use the IPVS proxier: Ipvs not supported"

wonderful-rain-13345

10/18/2024, 1:40 AM

I did an upgrade and haven't been able to create nodes.

Copy code

etcd-expose-metrics: false
      kube-proxy-arg:
        - proxy-mode=ipvs
        - ipvs-strict-arp=true

Is already set. I did diffs of my cillium config when upgrading

wonderful-rain-13345

10/18/2024, 1:41 AM

ohhh

Copy code

Running modprobe ip_vs failed with message: `modprobe: error while loading shared libraries: libzstd.so.1: cannot open shared object file: No such file or directory`, error: exit status 127"

i'm using the same base image odd

wonderful-rain-13345

10/18/2024, 1:44 AM

strange

wonderful-rain-13345

10/18/2024, 1:44 AM

i'm the admin lol

wonderful-rain-13345

10/18/2024, 2:09 AM

zstd

isn't getting installed on new nodes it seems

wonderful-rain-13345

10/18/2024, 2:54 AM

ok answer is you can't use jammy

wonderful-rain-13345

10/18/2024, 2:54 AM

ubuntu-noble-24.04-cloudimg

wonderful-rain-13345

10/18/2024, 2:54 AM

upgraded to ubuntu-noble-24.04-cloudimg-2024-10-17

wonderful-rain-13345

10/18/2024, 2:55 AM

this was probably in release notes and i missed it

wonderful-rain-13345

10/18/2024, 3:01 AM

ok now it just takes > 10 min and deletes the vm

wonderful-rain-13345

10/18/2024, 3:01 AM

never upgrade. ever.

wonderful-rain-13345

10/18/2024, 3:06 AM

yep no errors

Copy code

root@prod-workers-dlckv-vbm8j:/var/log/containers# tail -f *
==> cilium-2kgjb_kube-system_install-portmap-cni-plugin-1099be9951a529c80dc2b35de27660038a1206f8b107bc04ee13731f06c7a869.log <==
2024-10-18T03:04:28.054528923Z stdout F loopback is in SKIP_CNI_BINARIES, skipping
2024-10-18T03:04:28.058854779Z stdout F macvlan is in SKIP_CNI_BINARIES, skipping
2024-10-18T03:04:28.084236646Z stdout F copied /opt/cni/bin/portmap to /host/opt/cni/bin correctly
2024-10-18T03:04:28.088001364Z stdout F ptp is in SKIP_CNI_BINARIES, skipping
2024-10-18T03:04:28.091797496Z stdout F sbr is in SKIP_CNI_BINARIES, skipping
2024-10-18T03:04:28.096527088Z stdout F static is in SKIP_CNI_BINARIES, skipping
2024-10-18T03:04:28.118189342Z stdout F copied /opt/cni/bin/tap to /host/opt/cni/bin correctly
2024-10-18T03:04:28.12302963Z stdout F tuning is in SKIP_CNI_BINARIES, skipping
2024-10-18T03:04:28.12748478Z stdout F vlan is in SKIP_CNI_BINARIES, skipping
2024-10-18T03:04:28.131794167Z stdout F vrf is in SKIP_CNI_BINARIES, skipping

==> kube-proxy-prod-workers-dlckv-vbm8j_kube-system_kube-proxy-9134d4449c11622d196ce5536b03dee26c565631abba4bb2bbf403771c7da810.log <==
2024-10-18T03:04:46.996851096Z stderr F I1018 03:04:46.996748       1 server.go:874] "Golang settings" GOGC="" GOMAXPROCS="" GOTRACEBACK=""
2024-10-18T03:04:46.998750378Z stderr F I1018 03:04:46.998475       1 config.go:101] "Starting endpoint slice config controller"
2024-10-18T03:04:46.998783916Z stderr F I1018 03:04:46.998487       1 config.go:192] "Starting service config controller"
2024-10-18T03:04:46.998793619Z stderr F I1018 03:04:46.998504       1 config.go:319] "Starting node config controller"
2024-10-18T03:04:46.998802524Z stderr F I1018 03:04:46.998518       1 shared_informer.go:313] Waiting for caches to sync for endpoint slice config
2024-10-18T03:04:46.998811421Z stderr F I1018 03:04:46.998523       1 shared_informer.go:313] Waiting for caches to sync for service config
2024-10-18T03:04:46.998819964Z stderr F I1018 03:04:46.998527       1 shared_informer.go:313] Waiting for caches to sync for node config
2024-10-18T03:04:47.101770411Z stderr F I1018 03:04:47.099535       1 shared_informer.go:320] Caches are synced for endpoint slice config
2024-10-18T03:04:47.101799013Z stderr F I1018 03:04:47.099617       1 shared_informer.go:320] Caches are synced for node config
2024-10-18T03:04:47.101808444Z stderr F I1018 03:04:47.099654       1 shared_informer.go:320] Caches are synced for service config

==> pushprox-kube-proxy-client-v4sx6_cattle-monitoring-system_pushprox-client-8c700c0ebaa337fc81f4eff85f6cc89e0e1b9fc42411619055e35e43f87309d3.log <==

==> rancher-monitoring-prometheus-node-exporter-zz488_cattle-monitoring-system_node-exporter-90596f4fd33069f83beaa9fcffcd1caa110ee962d904f2f53daba029cc848337.log <==
2024-10-18T03:04:24.413827964Z stderr F ts=2024-10-18T03:04:24.413Z caller=node_exporter.go:117 level=info collector=thermal_zone
2024-10-18T03:04:24.413883731Z stderr F ts=2024-10-18T03:04:24.413Z caller=node_exporter.go:117 level=info collector=time
2024-10-18T03:04:24.413897931Z stderr F ts=2024-10-18T03:04:24.413Z caller=node_exporter.go:117 level=info collector=timex
2024-10-18T03:04:24.413909718Z stderr F ts=2024-10-18T03:04:24.413Z caller=node_exporter.go:117 level=info collector=udp_queues
2024-10-18T03:04:24.413921251Z stderr F ts=2024-10-18T03:04:24.413Z caller=node_exporter.go:117 level=info collector=uname
2024-10-18T03:04:24.413932868Z stderr F ts=2024-10-18T03:04:24.413Z caller=node_exporter.go:117 level=info collector=vmstat
2024-10-18T03:04:24.413944448Z stderr F ts=2024-10-18T03:04:24.413Z caller=node_exporter.go:117 level=info collector=xfs
2024-10-18T03:04:24.413956021Z stderr F ts=2024-10-18T03:04:24.413Z caller=node_exporter.go:117 level=info collector=zfs
2024-10-18T03:04:24.414928911Z stderr F ts=2024-10-18T03:04:24.414Z caller=tls_config.go:274 level=info msg="Listening on" address=[::]:9796
2024-10-18T03:04:24.414960089Z stderr F ts=2024-10-18T03:04:24.414Z caller=tls_config.go:277 level=info msg="TLS is disabled." http2=false address=[::]:9796

==> pushprox-kube-proxy-client-v4sx6_cattle-monitoring-system_pushprox-client-8c700c0ebaa337fc81f4eff85f6cc89e0e1b9fc42411619055e35e43f87309d3.log <==
2024-10-18T03:05:03.693417448Z stderr F ts=2024-10-18T03:05:03.692Z caller=main.go:269 level=info msg="URL and FQDN info" proxy_url=<http://pushprox-kube-proxy-proxy.cattle-monitoring-system.svc:8080/> fqdn=192.168.10.211
2024-10-18T03:05:54.61196902Z stderr F ts=2024-10-18T03:05:54.611Z caller=main.go:232 level=info msg="Got scrape request" scrape_id=71ba4d5b-9259-4ff5-abc1-174fe137fd17 url=<http://192.168.10.211:10249/metrics>
2024-10-18T03:05:54.629214776Z stderr F ts=2024-10-18T03:05:54.627Z caller=main.go:167 level=info scrape_id=71ba4d5b-9259-4ff5-abc1-174fe137fd17 msg="Retrieved scrape response"
2024-10-18T03:05:54.640633631Z stderr F ts=2024-10-18T03:05:54.640Z caller=main.go:173 level=info scrape_id=71ba4d5b-9259-4ff5-abc1-174fe137fd17 msg="Pushed scrape result"
^C
root@prod-workers-dlckv-vbm8j:/var/log/containers# tail -f *
==> cilium-2kgjb_kube-system_apply-sysctl-overwrites-fcb31ecf496f2f71830d395f7707697ceff695fc69d10d7df2a559f60a9395ac.log <==
2024-10-18T03:05:03.550333275Z stdout F sysctl config created/updated
2024-10-18T03:05:03.571730441Z stdout F systemd unit 'systemd-sysctl.service' restarted

==> cilium-2kgjb_kube-system_cilium-agent-03e82a96844e41a904e6729db3626568b1307635f980ce977f5d0a8eafd7f92c.log <==
2024-10-18T03:05:31.434749063Z stderr F time="2024-10-18T03:05:31Z" level=info msg="Updated link /sys/fs/bpf/cilium/devices/cilium_host/links/cil_to_host for program cil_to_host" subsys=datapath-loader
2024-10-18T03:05:31.434800072Z stderr F time="2024-10-18T03:05:31Z" level=info msg="Updated link /sys/fs/bpf/cilium/devices/cilium_host/links/cil_from_host for program cil_from_host" subsys=datapath-loader
2024-10-18T03:05:31.444961727Z stderr F time="2024-10-18T03:05:31Z" level=info msg="Updated link /sys/fs/bpf/cilium/devices/cilium_net/links/cil_to_host for program cil_to_host" subsys=datapath-loader
2024-10-18T03:05:31.455070324Z stderr F time="2024-10-18T03:05:31Z" level=info msg="Updated link /sys/fs/bpf/cilium/devices/ens192/links/cil_from_netdev for program cil_from_netdev" subsys=datapath-loader
2024-10-18T03:05:31.456573676Z stderr F time="2024-10-18T03:05:31Z" level=info msg="Reloaded endpoint BPF program" ciliumEndpointName=/ containerID= containerInterface= datapathPolicyRevision=31 desiredPolicyRevision=31 endpointID=502 identity=1 ipv4= ipv6= k8sPodName=/ subsys=endpoint
2024-10-18T03:05:33.630340864Z stderr F time="2024-10-18T03:05:33Z" level=info msg="Compiled new BPF template" BPFCompilationTime=3.952428938s file-path=/var/run/cilium/state/templates/8c74b47ac848a2eb2b4ad55b251f1021af01a11916691ddcec4081629c420ef1/bpf_lxc.o subsys=datapath-loader
2024-10-18T03:05:33.730543724Z stderr F time="2024-10-18T03:05:33Z" level=info msg="Program cil_from_container attached to device lxc_health using tcx" subsys=datapath-loader
2024-10-18T03:05:33.731071403Z stderr F time="2024-10-18T03:05:33Z" level=info msg="Reloaded endpoint BPF program" ciliumEndpointName=/ containerID= containerInterface= datapathPolicyRevision=0 desiredPolicyRevision=31 endpointID=760 identity=4 ipv4=10.42.144.97 ipv6= k8sPodName=/ subsys=endpoint
2024-10-18T03:05:33.817867344Z stderr F time="2024-10-18T03:05:33Z" level=info msg="Updated link /sys/fs/bpf/cilium/endpoints/760/links/cil_from_container for program cil_from_container" subsys=datapath-loader
2024-10-18T03:05:33.818347565Z stderr F time="2024-10-18T03:05:33Z" level=info msg="Reloaded endpoint BPF program" ciliumEndpointName=/ containerID= containerInterface= datapathPolicyRevision=31 desiredPolicyRevision=31 endpointID=760 identity=4 ipv4=10.42.144.97 ipv6= k8sPodName=/ subsys=endpoint

==> cilium-2kgjb_kube-system_clean-cilium-state-267395c65e4f15544378f0c4d6c234aa9019c2890a99b0e2746b2b6630a1ff1a.log <==

==> cilium-2kgjb_kube-system_config-fbc2ab79bf0bd2878ad7df6291d7cfa2b282d2fcd3905ace9961cb13a7683f54.log <==
2024-10-18T03:04:56.840115888Z stderr F time="2024-10-18T03:04:56Z" level=info msg="Establishing connection to apiserver" host="<https://10.43.0.1:443>" subsys=k8s-client
2024-10-18T03:04:56.861409014Z stderr F time="2024-10-18T03:04:56Z" level=info msg="Connected to apiserver" subsys=k8s-client
2024-10-18T03:04:56.865321631Z stderr F time="2024-10-18T03:04:56Z" level=info msg="Reading configuration from config-map:kube-system/cilium-config" configSource="config-map:kube-system/cilium-config" subsys=option-resolver
2024-10-18T03:04:56.87501038Z stderr F time="2024-10-18T03:04:56Z" level=info msg="Got 139 config pairs from source" configSource="config-map:kube-system/cilium-config" subsys=option-resolver
2024-10-18T03:04:56.87506166Z stderr F time="2024-10-18T03:04:56Z" level=info msg="Reading configuration from cilium-node-config:kube-system/" configSource="cilium-node-config:kube-system/" subsys=option-resolver
2024-10-18T03:04:56.895834868Z stderr F W1018 03:04:56.895396       1 warnings.go:70] <http://cilium.io/v2alpha1|cilium.io/v2alpha1> CiliumNodeConfig will be deprecated in cilium v1.16; use <http://cilium.io/v2|cilium.io/v2> CiliumNodeConfig
2024-10-18T03:04:56.896202622Z stderr F time="2024-10-18T03:04:56Z" level=info msg="Got 0 config pairs from source" configSource="cilium-node-config:kube-system/" subsys=option-resolver
2024-10-18T03:04:56.944118135Z stderr F 2024/10/18 03:04:56 INFO Started duration=103.87983ms
2024-10-18T03:04:56.945060402Z stderr F 2024/10/18 03:04:56 INFO Stopping
2024-10-18T03:04:56.945094104Z stderr F 2024/10/18 03:04:56 INFO health.job-module-status-metrics (rev=2) module=health

wonderful-rain-13345

10/18/2024, 3:06 AM

Copy code

==> cilium-2kgjb_kube-system_install-cni-binaries-73557773392eec7e0a007aee6a7a405f7e919c487ab41309c4436bfec1e2d1a0.log <==
2024-10-18T03:05:06.802529876Z stdout F Installing loopback to /host/opt/cni/bin/loopback ...
2024-10-18T03:05:06.83234181Z stdout F Wrote /host/opt/cni/bin/loopback
2024-10-18T03:05:06.838995105Z stdout F Installing cilium-cni to /host/opt/cni/bin/cilium-cni ...
2024-10-18T03:05:07.130704627Z stdout F Wrote /host/opt/cni/bin/cilium-cni

==> cilium-2kgjb_kube-system_install-portmap-cni-plugin-1099be9951a529c80dc2b35de27660038a1206f8b107bc04ee13731f06c7a869.log <==
2024-10-18T03:04:28.054528923Z stdout F loopback is in SKIP_CNI_BINARIES, skipping
2024-10-18T03:04:28.058854779Z stdout F macvlan is in SKIP_CNI_BINARIES, skipping
2024-10-18T03:04:28.084236646Z stdout F copied /opt/cni/bin/portmap to /host/opt/cni/bin correctly
2024-10-18T03:04:28.088001364Z stdout F ptp is in SKIP_CNI_BINARIES, skipping
2024-10-18T03:04:28.091797496Z stdout F sbr is in SKIP_CNI_BINARIES, skipping
2024-10-18T03:04:28.096527088Z stdout F static is in SKIP_CNI_BINARIES, skipping
2024-10-18T03:04:28.118189342Z stdout F copied /opt/cni/bin/tap to /host/opt/cni/bin correctly
2024-10-18T03:04:28.12302963Z stdout F tuning is in SKIP_CNI_BINARIES, skipping
2024-10-18T03:04:28.12748478Z stdout F vlan is in SKIP_CNI_BINARIES, skipping
2024-10-18T03:04:28.131794167Z stdout F vrf is in SKIP_CNI_BINARIES, skipping

==> cilium-2kgjb_kube-system_mount-bpf-fs-d4bd6411ff67a6968b080c0e4be9ca562c1f14e091b3d2bfdd22e9eab1068261.log <==
2024-10-18T03:05:04.524331438Z stdout F bpf on /sys/fs/bpf type bpf (rw,nosuid,nodev,noexec,relatime,mode=700)

==> cilium-2kgjb_kube-system_mount-cgroup-7161654cca4edb73efda9a69dd5341d667e16448570bd599e71e4af0b6931144.log <==
2024-10-18T03:05:02.793917307Z stderr F time="2024-10-18T03:05:02Z" level=info msg="Mounted cgroupv2 filesystem at /run/cilium/cgroupv2" subsys=cgroups

==> kube-proxy-prod-workers-dlckv-vbm8j_kube-system_kube-proxy-9134d4449c11622d196ce5536b03dee26c565631abba4bb2bbf403771c7da810.log <==
2024-10-18T03:04:46.996851096Z stderr F I1018 03:04:46.996748       1 server.go:874] "Golang settings" GOGC="" GOMAXPROCS="" GOTRACEBACK=""
2024-10-18T03:04:46.998750378Z stderr F I1018 03:04:46.998475       1 config.go:101] "Starting endpoint slice config controller"
2024-10-18T03:04:46.998783916Z stderr F I1018 03:04:46.998487       1 config.go:192] "Starting service config controller"
2024-10-18T03:04:46.998793619Z stderr F I1018 03:04:46.998504       1 config.go:319] "Starting node config controller"
2024-10-18T03:04:46.998802524Z stderr F I1018 03:04:46.998518       1 shared_informer.go:313] Waiting for caches to sync for endpoint slice config
2024-10-18T03:04:46.998811421Z stderr F I1018 03:04:46.998523       1 shared_informer.go:313] Waiting for caches to sync for service config
2024-10-18T03:04:46.998819964Z stderr F I1018 03:04:46.998527       1 shared_informer.go:313] Waiting for caches to sync for node config
2024-10-18T03:04:47.101770411Z stderr F I1018 03:04:47.099535       1 shared_informer.go:320] Caches are synced for endpoint slice config
2024-10-18T03:04:47.101799013Z stderr F I1018 03:04:47.099617       1 shared_informer.go:320] Caches are synced for node config
2024-10-18T03:04:47.101808444Z stderr F I1018 03:04:47.099654       1 shared_informer.go:320] Caches are synced for service config

==> pushprox-kube-proxy-client-v4sx6_cattle-monitoring-system_pushprox-client-8c700c0ebaa337fc81f4eff85f6cc89e0e1b9fc42411619055e35e43f87309d3.log <==
2024-10-18T03:05:03.693417448Z stderr F ts=2024-10-18T03:05:03.692Z caller=main.go:269 level=info msg="URL and FQDN info" proxy_url=<http://pushprox-kube-proxy-proxy.cattle-monitoring-system.svc:8080/> fqdn=192.168.10.211
2024-10-18T03:05:54.61196902Z stderr F ts=2024-10-18T03:05:54.611Z caller=main.go:232 level=info msg="Got scrape request" scrape_id=71ba4d5b-9259-4ff5-abc1-174fe137fd17 url=<http://192.168.10.211:10249/metrics>
2024-10-18T03:05:54.629214776Z stderr F ts=2024-10-18T03:05:54.627Z caller=main.go:167 level=info scrape_id=71ba4d5b-9259-4ff5-abc1-174fe137fd17 msg="Retrieved scrape response"
2024-10-18T03:05:54.640633631Z stderr F ts=2024-10-18T03:05:54.640Z caller=main.go:173 level=info scrape_id=71ba4d5b-9259-4ff5-abc1-174fe137fd17 msg="Pushed scrape result"

==> rancher-monitoring-prometheus-node-exporter-zz488_cattle-monitoring-system_node-exporter-90596f4fd33069f83beaa9fcffcd1caa110ee962d904f2f53daba029cc848337.log <==
2024-10-18T03:04:24.413827964Z stderr F ts=2024-10-18T03:04:24.413Z caller=node_exporter.go:117 level=info collector=thermal_zone
2024-10-18T03:04:24.413883731Z stderr F ts=2024-10-18T03:04:24.413Z caller=node_exporter.go:117 level=info collector=time
2024-10-18T03:04:24.413897931Z stderr F ts=2024-10-18T03:04:24.413Z caller=node_exporter.go:117 level=info collector=timex
2024-10-18T03:04:24.413909718Z stderr F ts=2024-10-18T03:04:24.413Z caller=node_exporter.go:117 level=info collector=udp_queues
2024-10-18T03:04:24.413921251Z stderr F ts=2024-10-18T03:04:24.413Z caller=node_exporter.go:117 level=info collector=uname
2024-10-18T03:04:24.413932868Z stderr F ts=2024-10-18T03:04:24.413Z caller=node_exporter.go:117 level=info collector=vmstat
2024-10-18T03:04:24.413944448Z stderr F ts=2024-10-18T03:04:24.413Z caller=node_exporter.go:117 level=info collector=xfs
2024-10-18T03:04:24.413956021Z stderr F ts=2024-10-18T03:04:24.413Z caller=node_exporter.go:117 level=info collector=zfs
2024-10-18T03:04:24.414928911Z stderr F ts=2024-10-18T03:04:24.414Z caller=tls_config.go:274 level=info msg="Listening on" address=[::]:9796
2024-10-18T03:04:24.414960089Z stderr F ts=2024-10-18T03:04:24.414Z caller=tls_config.go:277 level=info msg="TLS is disabled." http2=false address=[::]:9796

==> cilium-2kgjb_kube-system_cilium-agent-03e82a96844e41a904e6729db3626568b1307635f980ce977f5d0a8eafd7f92c.log <==
2024-10-18T03:06:10.763798752Z stderr F time="2024-10-18T03:06:10Z" level=info msg="regenerating all endpoints" reason="one or more identities created or deleted" subsys=endpoint-manager
2024-10-18T03:06:13.430771964Z stderr F time="2024-10-18T03:06:13Z" level=info msg="regenerating all endpoints" reason="one or more identities created or deleted" subsys=endpoint-manager
2024-10-18T03:06:14.430380732Z stderr F time="2024-10-18T03:06:14Z" level=info msg="regenerating all endpoints" reason="one or more identities created or deleted" subsys=endpoint-manager

wonderful-rain-13345

10/18/2024, 3:08 AM

Copy code

/var/lib/rancher/rke2/bin/crictl ps
CONTAINER           IMAGE               CREATED             STATE               NAME                ATTEMPT             POD ID              POD
03e82a96844e4       0fba21db2995c       3 minutes ago       Running             cilium-agent        0                   8d91bff29b0fd       cilium-2kgjb
9134d4449c116       074a06fa52dfe       3 minutes ago       Running             kube-proxy          0                   39d28b02b1512       kube-proxy-prod-workers-dlckv-vbm8j
90596f4fd3306       72c9c20889862       4 minutes ago       Running             node-exporter       0                   8c23479a14510       rancher-monitoring-prometheus-node-exporter-zz488
8c700c0ebaa33       af3e37f58a5fe       4 minutes ago       Running             pushprox-client     0                   00782d135b523       pushprox-kube-proxy-client-v4sx6

wonderful-rain-13345

10/18/2024, 3:10 AM

journalctl -u rke2-agent

has no errors.

wonderful-rain-13345

10/18/2024, 3:11 AM

oo found one!

Copy code

Oct 18 03:04:16 prod-workers-dlckv-vbm8j rke2[3312]: time="2024-10-18T03:04:16Z" level=error msg="Error encountered while importing /var/lib/rancher/rke2/agent/images/runtime-image.txt: failed to pull images from /var/lib/rancher/rke2/agent/images/runtime-image.txt: im>

creamy-pencil-82913

10/18/2024, 3:14 AM

we have never included etcdctl as a standalone binary with RKE2 in the same way we do kubectl or crictl. Were you really finding it inside one of the container images in the containerd image store and running it from the host!? That only works if the binary in the image is statically linked or linked against shared libraries that are available on your node. Which is NOT a safe assumption.

creamy-pencil-82913

10/18/2024, 3:16 AM

what are you trying to do with that anyway

wonderful-rain-13345

10/18/2024, 3:17 AM

so, whenever nodes fail to start and churn creates and deletes, in the paste it's been etcd like 99% of the time. like ghost nodes that don't exist but are still in membership

wonderful-rain-13345

10/18/2024, 3:18 AM

Copy code

etcdctl --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key --endpoints <https://127.0.0.1:2379/> --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt member list

wonderful-rain-13345

10/18/2024, 3:18 AM

to check. 3 nodes in rancher and >3 in etcd then i have had in the paste to remove the bad entry

wonderful-rain-13345

10/18/2024, 3:18 AM

anyway, i used to be able to ssh from rancher UI and run that

wonderful-rain-13345

10/18/2024, 3:19 AM

my real problem is that nodes can't be created.

wonderful-rain-13345

10/18/2024, 3:22 AM

also, it's not etcd 😄

wonderful-rain-13345

10/18/2024, 3:22 AM

rke-agent

Oct 18 031653 prod-workers-dd47cfcd4xc6c85-jqvt4 rke2[3170]: time="2024-10-18T031653Z" level=info msg="rke2 agent is up and running" Oct 18 031653 prod-workers-dd47cfcd4xc6c85-jqvt4 systemd[1]: Started Rancher Kubernetes Engine v2 (agent).

wonderful-rain-13345

10/18/2024, 3:22 AM

i get there.

wonderful-rain-13345

10/18/2024, 3:28 AM

Copy code

orting /var/lib/rancher/rke2/agent/images/runtime-image.txt: failed to pull images from /var/lib/rancher/rke2/agent/images/runtime-image.txt: image \"<http://index.docker.io/rancher/rke2-runtime:v1.30.4-rke2r1\|index.docker.io/rancher/rke2-runtime:v1.30.4-rke2r1\>": not found"

hmm I can pull it just fine?

wonderful-rain-13345

10/18/2024, 3:29 AM

Copy code

2024-10-18T03:29:12.012612806Z stderr F ts=2024-10-18T03:29:12.012Z caller=main.go:222 level=error msg="Error polling:" err="Post \"<http://pushprox-kube-proxy-proxy.cattle-monitoring-system.svc:8080/poll>\": dial tcp: lookup pushprox-kube-proxy-proxy.cattle-monitoring-sy

wonderful-rain-13345

10/18/2024, 3:30 AM

https://github.com/k3s-io/k3s/issues/9756

wonderful-rain-13345

10/18/2024, 3:30 AM

pretty sure you're the only committer Brandon

wonderful-rain-13345

10/18/2024, 4:04 AM

cilium shows

Copy code

CrashLoopBackOff (back-off 40s restarting failed container=config pod=cilium-g8d74_kube-system(f3a60b8a-6185-472e-854b-87b90dcaf304)) | Last state: Terminated with 1: Error (Running 2024/10/18 04:02:29 INFO Starting time=&quot;2024-10-18T04:02:29Z&quot; level=info msg=&quot;Establishing connection to apiserver&quot; host=&quot;<https://10.43.0.1:443>&quot; subsys=k8s-client time=&quot;2024-10-18T04:03:04Z&quot; level=info msg=&quot;Establishing connection to apiserver&quot; host=&quot;<https://10.43.0.1:443>&quot; subsys=k8s-client time=&quot;2024-10-18T04:03:34Z&quot; level=error msg=&quot;Unable to contact k8s api-server&quot; error=&quot;Get \&quot;<https://10.43.0.1:443/api/v1/namespaces/kube-system>\&quot;: dial tcp 10.43.0.1:443: i/o timeout&quot; ipAddr=&quot;<https://10.43.0.1:443>&quot; subsys=k8s-client 2024/10/18 04:03:34 ERROR Start hook failed function=&quot;client.(*compositeClientset).onStart (k8s-client)&quot; error=&quot;Get \&quot;<https://10.43.0.1:443/api/v1/namespaces/kube-system>\&quot;: dial tcp 10.43.0.1:443: i/o timeout&quot; 2024/10/18 04:03:34 ERROR Start failed error=&quot;Get \&quot;<https://10.43.0.1:443/api/v1/namespaces/kube-system>\&quot;: dial tcp 10.43.0.1:443: i/o timeout&quot; duration=1m5.011458077s 2024/10/18 04:03:34 INFO Stopping Error: Build config failed: failed to start: Get &quot;<https://10.43.0.1:443/api/v1/namespaces/kube-system&quot>;: dial tcp 10.43.0.1:443: i/o timeout ), started: Fri, Oct 18 2024 12:02:29 am, finished: Fri, Oct 18 2024 12:03:34 am

creamy-pencil-82913

10/18/2024, 4:51 AM

You can ignore those cannot pull errors from the .txt files. They're bogus. I thought I'd addressed it but apparently not.

creamy-pencil-82913

10/18/2024, 4:53 AM

Is kube-proxy running on these nodes? Are you using the cilium kube proxy replacement?

wonderful-rain-13345

10/18/2024, 7:47 PM

let me check

wonderful-rain-13345

10/18/2024, 8:13 PM

yes kube-proxy is running

creamy-pencil-82913

10/18/2024, 8:16 PM

That’s what handles programming iptables on the node to manage connectivity between pods and cluster service endpoints, including the in-cluster apiserver endpoint at

10.43.0.1:443

creamy-pencil-82913

10/18/2024, 8:17 PM

if kube-proxy is running, and the apiserver is up, then I’d check connectivity between the node in question and the apiserver nodes.

wonderful-rain-13345

10/18/2024, 8:18 PM

so, everything is the same aside from the updates, and all nodes that existsed pre-upgrade are fine

wonderful-rain-13345

10/18/2024, 8:18 PM

https://gist.github.com/ElanHasson/3b617cc274f993c5755b3b0c74478228

wonderful-rain-13345

10/18/2024, 8:18 PM

That is my cilium config

wonderful-rain-13345

10/18/2024, 8:19 PM

i upgraded through each version of rancher one at a time, then k8s one at a time, per versioning policy. ensuring i was within min/max versions for each (via the matrix)

wonderful-rain-13345

10/18/2024, 8:19 PM

I diffed the cilium config each time as it can change in rancher ui. I only brought forward things like

null

[]

{}

and of course image tags etc and new config

wonderful-rain-13345

10/22/2024, 2:18 PM

problem seems to be cilium failing to start

wonderful-rain-13345

10/22/2024, 10:02 PM

@clever-dusk-66120 kube-proxy is not running. i have kublet push-prox

wonderful-rain-13345

10/22/2024, 10:08 PM

Copy code

root@prod-workers-dd47cfcd4xc6c85-s6x4g:/var/log/containers# cat kube-proxy-prod-workers-dd47cfcd4xc6c85-s6x4g_kube-system_kube-proxy-95407588e6f3ca6fc13a7001c899a472f5a7ce53f0a86386b1831a0126ca89d2.log
2024-10-22T22:06:18.89262595Z stderr F I1022 22:06:18.892024       1 server.go:1062] "Successfully retrieved node IP(s)" IPs=["192.168.11.102"]
2024-10-22T22:06:19.004293519Z stderr F I1022 22:06:19.003852       1 server.go:659] "kube-proxy running in dual-stack mode" primary ipFamily="IPv4"
2024-10-22T22:06:19.00697528Z stderr F time="2024-10-22T22:06:19Z" level=warning msg="Running modprobe ip_vs failed with message: `modprobe: error while loading shared libraries: libzstd.so.1: cannot open shared object file: No such file or directory`, error: exit status 127"
2024-10-22T22:06:19.014651358Z stderr F time="2024-10-22T22:06:19Z" level=error msg="Could not get ipvs family information from the kernel. It is possible that ipvs is not enabled in your kernel. Native loadbalancing will not work until this is fixed."
2024-10-22T22:06:19.017635785Z stderr F I1022 22:06:19.017183       1 proxier.go:646] "Dummy VS not created" scheduler="rr"
2024-10-22T22:06:19.017676246Z stderr F E1022 22:06:19.017261       1 server.go:558] "Error running ProxyServer" err="can't use the IPVS proxier: Ipvs not supported"
2024-10-22T22:06:19.017708694Z stderr F E1022 22:06:19.017323       1 run.go:74] "command failed" err="can't use the IPVS proxier: Ipvs not supported"

wonderful-rain-13345

10/22/2024, 10:08 PM

dmesg | grep ipvs

is empty

wonderful-rain-13345

10/22/2024, 10:11 PM

this all used to work. So this is with

jammy

Copy code

# cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=22.04
DISTRIB_CODENAME=jammy
DISTRIB_DESCRIPTION="Ubuntu 22.04.4 LTS"

# modprobe ipvs
modprobe: FATAL: Module ipvs not found in directory /lib/modules/5.15.0-105-generic

wonderful-rain-13345

10/22/2024, 10:44 PM

with noble,

Copy code

[  137.441108] IPVS: Registered protocols (TCP, UDP, SCTP, AH, ESP)
[  137.441149] IPVS: Connection hash table configured (size=4096, memory=32Kbytes)
[  137.441258] IPVS: ipvs loaded.
[  137.458691] IPVS: [rr] scheduler registered.
[  137.459841] IPVS: starting estimator thread 0...
[  137.522567] IPVS: using max 3984 ests per chain, 199200 per kthread

wonderful-rain-13345

10/22/2024, 10:44 PM

ok, so kube proxy is running here

wonderful-rain-13345

10/22/2024, 10:45 PM

Ok here we go 😄

Copy code

==> apply-system-agent-upgrader-on-prod-workers-dlckv-9s5lf-w-wsrnh_cattle-system_upgrade-7192caf6072216187325ad879e63ff76eafafb5f17c73574d5e70532dc29af70.log <==
2024-10-22T22:41:10.059179477Z stderr F + CATTLE_AGENT_UNINSTALL_LOCAL=true
2024-10-22T22:41:10.05919205Z stderr F + export CATTLE_AGENT_BINARY_LOCAL_LOCATION=/var/lib/rancher/agent/tmp/tmp.FsJlsYW9Q2/rancher-system-agent
2024-10-22T22:41:10.059203981Z stderr F + CATTLE_AGENT_BINARY_LOCAL_LOCATION=/var/lib/rancher/agent/tmp/tmp.FsJlsYW9Q2/rancher-system-agent
2024-10-22T22:41:10.059223065Z stderr F + export CATTLE_AGENT_UNINSTALL_LOCAL_LOCATION=/var/lib/rancher/agent/tmp/tmp.FsJlsYW9Q2/rancher-system-agent-uninstall.sh
2024-10-22T22:41:10.059244439Z stderr F + CATTLE_AGENT_UNINSTALL_LOCAL_LOCATION=/var/lib/rancher/agent/tmp/tmp.FsJlsYW9Q2/rancher-system-agent-uninstall.sh
2024-10-22T22:41:10.059257132Z stderr F + '[' -s /host/etc/systemd/system/rancher-system-agent.env ']'
2024-10-22T22:41:10.059270263Z stderr F + chroot /host /var/lib/rancher/agent/tmp/tmp.FsJlsYW9Q2/install.sh
2024-10-22T22:41:10.084350706Z stderr F [FATAL]  You must select at least one role.
2024-10-22T22:41:10.08605501Z stderr F + cleanup
2024-10-22T22:41:10.086088531Z stderr F + rm -rf /host/var/lib/rancher/agent/tmp/tmp.FsJlsYW9Q2

creamy-pencil-82913

10/22/2024, 11:19 PM

did you manually put it in ipvs mode? RKE2 runs kube-proxy in iptables mode by default. If you switched it to ipvs mode, but then don’t have the correct kernel modules on your nodes, that would be a problem.

wonderful-rain-13345

10/22/2024, 11:20 PM

no I did not. I didn't change my cilium/kube proxy config. So apparently, the kernel in jammy is too old for the newer rke2/k8s? Unsure. But the issue was the upgrade plan. I for some reason was using an k3s upgrade plan. Probably was late and i blindly copied and pasted and didn't modify 😞

wonderful-rain-13345

10/22/2024, 11:21 PM

i deleted the plans and it looks like i progressed

creamy-pencil-82913

10/22/2024, 11:21 PM

> 2024-10-22T220619.017676246Z stderr F E1022 220619.017261 1 server.go:558] “Error running ProxyServer” err=“can’t use the IPVS proxier: Ipvs not supported” > 2024-10-22T220619.017708694Z stderr F E1022 220619.017323 1 run.go:74] “command failed” err=“can’t use the IPVS proxier: Ipvs not supported” Someone has intentionally put kube-proxy in IPVS mode. It does not use ipvs by default. Check your kube-proxy-args in the RKE2 config.

wonderful-rain-13345

10/22/2024, 11:23 PM

in cilium values.yaml?

creamy-pencil-82913

10/22/2024, 11:23 PM

idk, is that from the kube-proxy pod log, or a cilium pod?

wonderful-rain-13345

10/22/2024, 11:23 PM

that was kube proxy

creamy-pencil-82913

10/22/2024, 11:24 PM

ok, so that’d be in /etc/rancher/rke2/config.yaml or a file under /etc/rancher/rke2/config.yaml.d/

wonderful-rain-13345

10/22/2024, 11:24 PM

Copy code

machineGlobalConfig:
      cni: cilium
      disable-kube-proxy: false
      etcd-expose-metrics: false
      kube-proxy-arg:
        - proxy-mode=ipvs
        - ipvs-strict-arp=true
      kube-proxy-extra-mount:
        - /lib/modules:/lib/modules:ro

creamy-pencil-82913

10/22/2024, 11:24 PM

welp, like I said you put it in ipvs mode

wonderful-rain-13345

10/22/2024, 11:24 PM

it's been like that for a few years

wonderful-rain-13345

10/22/2024, 11:25 PM

can ipvs work with kube proxy? or only when no kube proxy?

creamy-pencil-82913

10/22/2024, 11:25 PM

apparently in previous years your nodes had the correct ipvs kernel modules loaded

creamy-pencil-82913

10/22/2024, 11:25 PM

it can, if you load the kernel modules

wonderful-rain-13345

10/22/2024, 11:26 PM

yeah, i'm saying i haven't touched my images, i use the raw ubuntu cloud images. The upgrade controller plans were wrong, which broke stuff. But now i deleted the plans and am here now. So it seems to be working now. except node not ready. let me check rke2 logs

wonderful-rain-13345

10/22/2024, 11:28 PM

journalctl -u rke2-server

looks fine

wonderful-rain-13345

10/22/2024, 11:31 PM

Copy code

1 handler_proxy.go:93] no RequestInfo found in the context
2024-10-22T23:31:22.666351775Z stderr F E1022 23:31:22.666169       1 controller.go:146] Error updating APIService "<http://v1.packages.operators.coreos.com|v1.packages.operators.coreos.com>" with err: failed to download <http://v1.packages.operators.coreos.com|v1.packages.operators.coreos.com>: failed to retrieve openAPI spec, http error: ResponseCode: 503, Body: service unavailable
2024-10-22T23:31:22.666365696Z stderr F , Header: map[Content-Type:[text/plain; charset=utf-8] X-Content-Type-Options:[nosniff]]
2024-10-22T23:31:22.690183818Z stderr F E1022 23:31:22.689489       1 available_controller.go:460] v1.packages.operators.coreos.c

wonderful-rain-13345

10/22/2024, 11:33 PM

yeah

Copy code

2024-10-22T23:33:05.509294141Z stderr F E1022 23:33:05.508966       1 handler_proxy.go:137] error resolving kasten-io/aggregatedapis-svc: no endpoints available for service "aggregatedapis-svc"
2024-10-22T23:33:05.509362827Z stderr F E1022 23:33:05.509141       1 handler_proxy.go:137] error resolving kube-system/rke2-metrics-server: no endpoints available for service "rke2-metrics-server"
2024-10-22T23:33:05.509373402Z stderr F E1022 23:33:05.508990       1 handler_proxy.go:137] error resolving cattle-monitoring-system/rancher-monitoring-prometheus-adapter: no endpoints available for service "rancher-monitoring-prometheus-adapter"
2024-10-22T23:33:05.509409316Z stderr F E1022 23:33:05.509226       1 handler_proxy.go:137] error resolving kasten-io/aggregatedapis-svc: no endpoints available for service "aggregatedapis-svc"
2024-10-22T23:33:05.509430565Z stderr F E1022 23:33:05.509307       1 handler_proxy.go:137] error resolving kasten-io/aggregatedapis-svc: no endpoints available for service "aggregatedapis-svc"
2024-10-22T23:33:05.509528621Z stderr F E1022 23:33:05.509398       1 handler_proxy.go:137] error resolving kasten-io/aggregatedapis-svc: no endpoints available for service "aggregatedapis-svc"

wonderful-rain-13345

10/22/2024, 11:33 PM

something is up with api server this is an api server only node

creamy-pencil-82913

10/22/2024, 11:36 PM

There is a CRD (

<http://v1.packages.operators.coreos.com|v1.packages.operators.coreos.com>

) that has api aggregation enabled, and it can’t connect to the backing service to update the openapi spec. Are the kasten pods up?

wonderful-rain-13345

10/22/2024, 11:36 PM

no.

creamy-pencil-82913

10/22/2024, 11:36 PM

well that would be why there are no endpoints

wonderful-rain-13345

10/22/2024, 11:36 PM

this is an existing cluster.

creamy-pencil-82913

10/22/2024, 11:36 PM

that doesn’t mean the apiserver isn’t up, it just means its trying to update some stuff and it cant

wonderful-rain-13345

10/22/2024, 11:37 PM

ahh ok

wonderful-rain-13345

10/22/2024, 11:37 PM

so what makes the node marked as ready in rancher?

creamy-pencil-82913

10/22/2024, 11:37 PM

the node or the cluster?

wonderful-rain-13345

10/22/2024, 11:37 PM

node. See second image above

wonderful-rain-13345

10/22/2024, 11:38 PM

k8s says node is ready, but rancher is waiting

creamy-pencil-82913

10/22/2024, 11:38 PM

is rancher-system-agent service running on the node? is the cattle-cluster-agent pod running in the cluster? are there errors in the logs for either?

creamy-pencil-82913

10/22/2024, 11:38 PM

start checking rancher component logs

wonderful-rain-13345

10/22/2024, 11:39 PM

https://rancher-users.slack.com/archives/C3ASABBD1/p1729606301478759

wonderful-rain-13345

10/22/2024, 11:40 PM

so that's a negative i guess

wonderful-rain-13345

10/23/2024, 12:11 AM

ok cattle cluster agent is running

wonderful-rain-13345

10/23/2024, 12:11 AM

it's applying crds

wonderful-rain-13345

10/23/2024, 12:25 AM

ok i think i see what's going on. The old nodes pre-upgrade of rke2/rancher/k8s cilium is broken because ipvs stopped working suddenly. so i guess spin up new nodes and go from there

wonderful-rain-13345

10/23/2024, 12:40 AM

ok, next up from kube-sys

Copy code

Failed calling webhook, failing closed <http://rancher.cattle.io.clusters.management.cattle.io|rancher.cattle.io.clusters.management.cattle.io>: failed calling webhook "<http://rancher.cattle.io.clusters.management.cattle.io|rancher.cattle.io.clusters.management.cattle.io>": failed to call webhook: Post "<https://rancher-webhook.cattle-system.svc:443/v1/webhook/mutation/clusters.management.cattle.io?timeout=10s>": no endpoints available for service "rancher-webhook"

wonderful-rain-13345

10/25/2024, 1:01 AM

yeah so it's totally RKE2/rancher here.

wonderful-rain-13345

10/25/2024, 1:02 AM

Using the same exact vm image (jammy cloud image) i used pre-upgrade of rke2/rancher, and a restore of etcd+cluster+k8s version backup from pre-rke2/rancher upgrade and it just sits there

wonderful-rain-13345

10/25/2024, 1:02 AM

it's not even now getting to the point of starting containers up. so this isn't a cilium thing i think

wonderful-rain-13345

10/25/2024, 1:25 AM

as i can make new clusters with the defaults (+vsphere cspi) and new noble ubuntu image

75 Views

Open in Slack

Previous Next