swift-farmer-84003
10/28/2025, 10:35 AMrich-thailand-55018
10/28/2025, 5:26 PMancient-dinner-76338
10/30/2025, 4:15 AMelegant-truck-75829
10/30/2025, 7:31 AMbetter-rain-46397
10/30/2025, 10:03 PMglibc used in SUSE Docker images such as <http://docker.io/rancher/fleet-agent|docker.io/rancher/fleet-agent>. Some of the security issues are CVE-2025-4802, 2025:01702-1, and SUSE-SU-2025:0582-1. Anyone know if there will be updates to the usage of said images within the rancher fleet?square-gold-26983
10/31/2025, 12:22 AMwhite-notebook-25493
10/31/2025, 2:41 PMstrong-action-64019
11/03/2025, 7:18 AMbetter-telephone-21557
11/03/2025, 4:25 PMancient-dinner-76338
11/04/2025, 3:51 AMlittle-ram-70987
11/04/2025, 2:17 PMnutritious-intern-6999
11/05/2025, 12:54 PMastonishing-nail-55291
11/05/2025, 1:34 PMLogging in failed: Your account may not be authorized to log in"
Followed the documentation here: https://ranchermanager.docs.rancher.com/how-to-guides/new-user-guides/authentication-permiss[ā¦]uration/authentication-config/configure-keycloak-samlproud-secretary-84522
11/05/2025, 5:07 PMnvidia-device-plugin-daemonset. The nvidia-device-plugin-daemonset pod is stuck in a CrashLoopBackOff loop.
The NVIDIA driver is correctly installed on the host, and nvidia-smi works as expected. When I run kubectl describe pod nvidia-device-plugin-daemonset-85bkr -n kube-system on the failing pod, I get this specific error message:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning BackOff 40m (x1266 over 5h15m) kubelet Back-off restarting failed container nvidia-device-plugin-ctr in pod nvidia-device-plugin-daemonset-85bkr_kube-system(3f6e3f80-6add-41bc-b49a-6e0aa8f2af30)
Normal Pulled 38m (x59 over 5h15m) kubelet Container image "<http://nvcr.io/nvidia/k8s-device-plugin:v0.18.0|nvcr.io/nvidia/k8s-device-plugin:v0.18.0>" already present on machine
Warning Failed 23m (x5 over 26m) kubelet Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: expected cgroupsPath to be of format "slice:prefix:name" for systemd cgroups, got "/kubepods/besteffort/pod3f6e3f80-6add-41bc-b49a-6e0aa8f2af30/nvidia-device-plugin-ctr" instead
Normal Pulled 4m44s (x9 over 26m) kubelet Container image "<http://nvcr.io/nvidia/k8s-device-plugin:v0.18.0|nvcr.io/nvidia/k8s-device-plugin:v0.18.0>" already present on machine
Normal Created 4m44s (x9 over 26m) kubelet Created container: nvidia-device-plugin-ctr
Warning BackOff 55s (x117 over 26m) kubelet Back-off restarting failed container nvidia-device-plugin-ctr in pod nvidia-device-plugin-daemonset-85bkr_kube-system(3f6e3f80-6add-41bc-b49a-6e0aa8f2af30)
So any pod that requires the GPU basically is pending...
kubectl get pods
NAME READY STATUS RESTARTS AGE
nvidia-gpu-test 0/1 Pending 0 5h10m
I have been following this guide here https://docs.rke2.io/advanced?_highlight=gpu#deploy-nvidia-operator
any help? thanks
and this is the config.toml.tmpl
cat /var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl
{{ template "base" . }}
[plugins."io.containerd.cri.v1.cri".containerd]
default_runtime_name = "nvidia"
[plugins."io.containerd.cri.v1.runtime".containerd.runtimes.nvidia]
privileged_without_host_devices = false
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.cri.v1.runtime".containerd.runtimes.nvidia.options]
BinaryName = "/usr/bin/nvidia-container-runtime"
Any help?elegant-truck-75829
11/06/2025, 8:16 AMnutritious-intern-6999
11/06/2025, 1:38 PMshy-gold-40913
11/06/2025, 5:21 PMlevel=fatal msg="Failed to reconcile with temporary etcd: failed to normalize server token; must be in format K10<CA-HASH>::<USERNAME>:<PASSWORD> or <PASSWORD>"
In /etc/rancher/rke2/config.yaml I've specified both agent-token and token using the output of rke2 token generate (ran it twice), which produced tokens in [a-z0-9]{6}.[a-z0-9]{16} format. What am I doing wrong here?creamy-pharmacist-50075
11/08/2025, 10:36 AMcrooked-sunset-83417
11/09/2025, 1:16 AMnutritious-intern-6999
11/10/2025, 10:08 AMbreezy-restaurant-60331
11/11/2025, 2:10 PMadamant-kite-43734
11/12/2025, 9:45 AMbrief-vase-99095
11/12/2025, 5:14 PMshy-gold-40913
11/12/2025, 6:27 PM# firewall-cmd --zone=public --list-all
public (default, active)
target: default
ingress-priority: 0
egress-priority: 0
icmp-block-inversion: no
interfaces: ens192
sources:
services: etcd-client etcd-server kube-apiserver kubelet wireguard
ports: 9345/tcp 9099/tcp 30000-32767/tcp 2381/tcp 51821/udp 8472/udp
protocols:
forward: yes
masquerade: no
forward-ports:
source-ports:
icmp-blocks:
rich rules:
Agent nodes:
# firewall-cmd --zone=public --list-all
public (default, active)
target: default
ingress-priority: 0
egress-priority: 0
icmp-block-inversion: no
interfaces: ens192 ens224
sources:
services: kubelet wireguard
ports: 9099/tcp 30000-32767/tcp 8472/udp 51821/udp
protocols:
forward: yes
masquerade: no
forward-ports:
source-ports:
icmp-blocks:
rich rules:
Output from overlaytest:
# ./overlaytest.sh
=> Start network overlay test
k8sagent02 can reach k8sagent02
command terminated with exit code 1
FAIL: overlaytest-4dtr4 on k8sagent02 cannot reach pod IP 10.252.2.2 on k8ssvr02
command terminated with exit code 1
FAIL: overlaytest-4dtr4 on k8sagent02 cannot reach pod IP 10.252.0.4 on k8ssvr01
command terminated with exit code 1
FAIL: overlaytest-4dtr4 on k8sagent02 cannot reach pod IP 10.252.3.19 on k8sagent01
command terminated with exit code 1
FAIL: overlaytest-4dtr4 on k8sagent02 cannot reach pod IP 10.252.1.2 on k8ssvr03
command terminated with exit code 1
FAIL: overlaytest-8vxld on k8ssvr02 cannot reach pod IP 10.252.4.3 on k8sagent02
k8ssvr02 can reach k8ssvr02
command terminated with exit code 1
FAIL: overlaytest-8vxld on k8ssvr02 cannot reach pod IP 10.252.0.4 on k8ssvr01
command terminated with exit code 1
FAIL: overlaytest-8vxld on k8ssvr02 cannot reach pod IP 10.252.3.19 on k8sagent01
command terminated with exit code 1
FAIL: overlaytest-8vxld on k8ssvr02 cannot reach pod IP 10.252.1.2 on k8ssvr03
command terminated with exit code 1
FAIL: overlaytest-ds7sh on k8ssvr01 cannot reach pod IP 10.252.4.3 on k8sagent02
command terminated with exit code 1
FAIL: overlaytest-ds7sh on k8ssvr01 cannot reach pod IP 10.252.2.2 on k8ssvr02
k8ssvr01 can reach k8ssvr01
command terminated with exit code 1
FAIL: overlaytest-ds7sh on k8ssvr01 cannot reach pod IP 10.252.3.19 on k8sagent01
command terminated with exit code 1
FAIL: overlaytest-ds7sh on k8ssvr01 cannot reach pod IP 10.252.1.2 on k8ssvr03
command terminated with exit code 1
FAIL: overlaytest-jw99g on k8sagent01 cannot reach pod IP 10.252.4.3 on k8sagent02
command terminated with exit code 1
FAIL: overlaytest-jw99g on k8sagent01 cannot reach pod IP 10.252.2.2 on k8ssvr02
command terminated with exit code 1
FAIL: overlaytest-jw99g on k8sagent01 cannot reach pod IP 10.252.0.4 on k8ssvr01
k8sagent01 can reach k8sagent01
command terminated with exit code 1
FAIL: overlaytest-jw99g on k8sagent01 cannot reach pod IP 10.252.1.2 on k8ssvr03
command terminated with exit code 1
FAIL: overlaytest-mmsv9 on k8ssvr03 cannot reach pod IP 10.252.4.3 on k8sagent02
command terminated with exit code 1
FAIL: overlaytest-mmsv9 on k8ssvr03 cannot reach pod IP 10.252.2.2 on k8ssvr02
command terminated with exit code 1
FAIL: overlaytest-mmsv9 on k8ssvr03 cannot reach pod IP 10.252.0.4 on k8ssvr01
command terminated with exit code 1
FAIL: overlaytest-mmsv9 on k8ssvr03 cannot reach pod IP 10.252.3.19 on k8sagent01
k8ssvr03 can reach k8ssvr03
=> End network overlay test
I've also excluded the various tunnel interfaces from NetworkManager per this https://docs.rke2.io/known_issues#networkmanager
# cat /etc/NetworkManager/conf.d/rke2-canal.conf
[keyfile]
unmanaged-devices=interface-name:flannel*;interface-name:cali*;interface-name:tunl*;interface-name:vxlan.calico;interface-name:vxlan-v6.calico;interface-name:wireguard.cali;interface-name:wg-v6.cali
How do I begin troubleshooting this? I'm running rke2 stable on AlmaLinux 10nutritious-petabyte-80748
11/13/2025, 4:31 AMblue-jelly-47972
11/13/2025, 4:59 AMmost-balloon-51259
11/13/2025, 9:48 AMkind-air-74358
11/13/2025, 10:49 AMfleet-agent is constant being restarted. It could be well possible this is caused by either a Rancher update from 2.11 to 2.12 or caused by switching a root certificate from self-signed to a provided root certificate (following these docs).
In the fleet-controller/fleet-agentmanagement we see the following logs constantly
time="2025-11-13T10:44:39Z" level=info msg="Deleted old agent for cluster (fleet-local/local) in namespace cattle-fleet-local-system"
time="2025-11-13T10:44:39Z" level=info msg="Cluster import for 'fleet-local/local'. Deployed new agent"
time="2025-11-13T10:45:00Z" level=info msg="Waiting for service account token key to be populated for secret cluster-fleet-local-local-1a3d67d0a899/request-cs9x7-8645b8de-5e30-4eb0-a9fe-dc96f1081856-token"
time="2025-11-13T10:45:02Z" level=info msg="Cluster registration request 'fleet-local/request-cs9x7' granted, creating cluster, request service account, registration secret"
The fleet-agent in the cattle-fleet-local-system isn't reporting any errors but just reposts
I1113 10:44:40.589439 1 leaderelection.go:257] attempting to acquire leader lease cattle-fleet-local-system/fleet-agent...
{"level":"info","ts":"2025-11-13T10:44:40Z","logger":"setup","msg":"new leader","identity":"fleet-agent-6d5f55c7d7-4pncc-1"}
I1113 10:45:00.267179 1 leaderelection.go:271] successfully acquired lease cattle-fleet-local-system/fleet-agent
{"level":"info","ts":"2025-11-13T10:45:00Z","logger":"setup","msg":"renewed leader","identity":"fleet-agent-5cf8799b4c-xn274-1"}
time="2025-11-13T10:45:00Z" level=warning msg="Cannot find fleet-agent secret, running registration"
time="2025-11-13T10:45:00Z" level=info msg="Creating clusterregistration with id 'pwvp47nf7r8pg8zfmd4tx7vxb6rhr5dwv2gcnn2m6zlrtmt54ss9kl' for new token"
time="2025-11-13T10:45:02Z" level=info msg="Waiting for secret 'cattle-fleet-clusters-system/c-9072b2e8eac3a21368e0428adc1a0244a61acd4ee571c7f88f574d905cd52' on management cluster for request 'fleet-local/request-cs9x7': secrets \"c-9072b2e8eac3a21368e0428adc1a0244a61acd4ee571c7f88f574d905cd52\" not found"
{"level":"info","ts":"2025-11-13T10:45:04Z","logger":"setup","msg":"successfully registered with upstream cluster","namespace":"cluster-fleet-local-local-1a3d67d0a899"}
{"level":"info","ts":"2025-11-13T10:45:04Z","logger":"setup","msg":"listening for changes on upstream cluster","cluster":"local","namespace":"cluster-fleet-local-local-1a3d67d0a899"}
{"level":"info","ts":"2025-11-13T10:45:04Z","logger":"setup","msg":"Starting controller","metricsAddr":":8080","probeAddr":":8081","systemNamespace":"cattle-fleet-local-system"}
{"level":"info","ts":"2025-11-13T10:45:04Z","logger":"setup","msg":"starting manager"}
{"level":"info","ts":"2025-11-13T10:45:04Z","logger":"controller-runtime.metrics","msg":"Starting metrics server"}
{"level":"info","ts":"2025-11-13T10:45:04Z","msg":"starting server","name":"health probe","addr":"0.0.0.0:8081"}
{"level":"info","ts":"2025-11-13T10:45:04Z","logger":"controller-runtime.metrics","msg":"Serving metrics server","bindAddress":":8080","secure":false}
{"level":"info","ts":"2025-11-13T10:45:04Z","msg":"Starting EventSource","controller":"bundledeployment","controllerGroup":"<http://fleet.cattle.io|fleet.cattle.io>","controllerKind":"BundleDeployment","source":"kind source: *v1alpha1.BundleDeployment"}
{"level":"info","ts":"2025-11-13T10:45:04Z","logger":"setup","msg":"Starting cluster status ticker","checkin interval":"15m0s","cluster namespace":"fleet-local","cluster name":"local"}
{"level":"info","ts":"2025-11-13T10:45:04Z","msg":"Starting EventSource","controller":"drift-reconciler","source":"channel source: 0xc00078f3b0"}
{"level":"info","ts":"2025-11-13T10:45:04Z","msg":"Starting Controller","controller":"drift-reconciler"}
{"level":"info","ts":"2025-11-13T10:45:04Z","msg":"Starting workers","controller":"drift-reconciler","worker count":50}
{"level":"info","ts":"2025-11-13T10:45:04Z","msg":"Starting Controller","controller":"bundledeployment","controllerGroup":"<http://fleet.cattle.io|fleet.cattle.io>","controllerKind":"BundleDeployment"}
{"level":"info","ts":"2025-11-13T10:45:04Z","msg":"Starting workers","controller":"bundledeployment","controllerGroup":"<http://fleet.cattle.io|fleet.cattle.io>","controllerKind":"BundleDeployment","worker count":50}
{"level":"info","ts":"2025-11-13T10:45:04Z","logger":"bundledeployment.helm-deployer.install","msg":"Upgrading helm release","controller":"bundledeployment","controllerGroup":"<http://fleet.cattle.io|fleet.cattle.io>","controllerKind":"BundleDeployment","BundleDeployment":{"name":"fleet-agent-local","namespace":"cluster-fleet-local-local-1a3d67d0a899"},"namespace":"cluster-fleet-local-local-1a3d67d0a899","name":"fleet-agent-local","reconcileID":"1e4df644-069b-4d05-84ed-2c447bc54d15","commit":"","dryRun":false}
{"level":"info","ts":"2025-11-13T10:45:05Z","logger":"bundledeployment.deploy-bundle","msg":"Deployed bundle","controller":"bundledeployment","controllerGroup":"<http://fleet.cattle.io|fleet.cattle.io>","controllerKind":"BundleDeployment","BundleDeployment":{"name":"fleet-agent-local","namespace":"cluster-fleet-local-local-1a3d67d0a899"},"namespace":"cluster-fleet-local-local-1a3d67d0a899","name":"fleet-agent-local","reconcileID":"1e4df644-069b-4d05-84ed-2c447bc54d15","deploymentID":"s-2f332c47bb36e1bc8d70932ee0158e1b3289ae7ef2ea995e2bd77828ef2e9:8a42b4463e55a59ce2ccdf3c53c32455ce5fd0f601587bf57b5624b3cf8bb623","appliedDeploymentID":"s-c1fc5eeb18677acb8c4a8fd2054c2c40c4022f002ea06437f9b108731be8f:8a42b4463e55a59ce2ccdf3c53c32455ce5fd0f601587bf57b5624b3cf8bb623","release":"cattle-fleet-local-system/fleet-agent-local:20","DeploymentID":"s-2f332c47bb36e1bc8d70932ee0158e1b3289ae7ef2ea995e2bd77828ef2e9:8a42b4463e55a59ce2ccdf3c53c32455ce5fd0f601587bf57b5624b3cf8bb623"
And afterwards the fleet-agent is restarted again... Any idea's on what could be wrong and how to fix it?
We've already redeployed the fleet-controller, fleet-agent deployments, reinstalled the fleet-agent and fleet-controller helm charts with no luckcrooked-sunset-83417
11/13/2025, 7:12 PMhallowed-manchester-34892
11/14/2025, 3:39 AM