Hi all, Anyone faced this issue while adding rke2 ...
# rke2
i
Hi all, Anyone faced this issue while adding rke2 cluster nodes to rancher ? [ERROR] error syncing ‘c-m-csvjq78f’: handler cluster-deploy: cluster context c-m-csvjq78f is unavailable, requeuing [INFO] [planner] rkecluster fleet-default/cluster-main: configuring bootstrap node(s) custom-6cccc7462912: waiting for cluster agent to connect [ERROR] error syncing ‘_all_‘: handler user-controllers-controller: userControllersController: failed to set peers for key _all_: failed to start user controllers for cluster c-m-csvjq78f: ClusterUnavailable 503: cluster not found, requeuing
c
log in to one of the existing nodes and see why the cattle-cluster-agent pod isn’t connected to rancher
i
it looks like the cattle-cluster-agent is not deployed yet. Im seeing only these in the rancher-system-agent.service $ journalctl -u rancher-system-agent.service -f Started rancher-system-agent.service - Rancher System Agent. time=“2025-10-02T194006Z” level=info msg=“Rancher System Agent version v0.3.13 (5a64be2) is starting” time=“2025-10-02T194006Z” level=info msg=“Using directory /var/lib/rancher/agent/work for work” time=“2025-10-02T194006Z” level=info msg=“Starting remote watch of plans” time=“2025-10-02T194006Z” level=info msg=“Starting /v1, Kind=Secret controller” In the rancher UI it is waiting for the cluster agent to connect. I’ve checked the connectivity from the master and worker nodes to rancher server endpoint and it looks good. i see few other people also reported the same issue. https://slack-archive.rancher.com/t/28536525/since-we-ve-updated-rancher-to-2-11-newly-created-clusters-u https://slack-archive.rancher.com/t/27160825/hi-i-m-having-an-issue-when-i-deploy-rancher-using-the-docke
c
cattle cluster agent, not rancher system agent
cattle cluster agent runs as a pod IN the cluster. You’d need to look on an existing server node that is already up.
i
Yeah, it looks like the cluster is stuck in the bootstrap phase (waiting for the cluster agent to connect ). it seems the core kubernetes controlplane isn’t fully initialized. so the cattle-cluster-agent is not deployed yet.
c
You said you were trying to add nodes. Is this an existing functional cluster that you are trying to add nodes to?
Or are you just trying to bring the cluster up for the first time.
i
I’m trying to bring it up for the first time. I’m deploying rancher using docker compose and trying to add master and worker nodes to it.
c
Well for starters running rancher in docker isn't technically supported for anything except basically toy deployments. But it should generally work.
Did you add nodes with all roles to the cluster? Etcd, control plane, and worker?
Check the logs on the server (etcd and control-plane) nodes to see why it's not coming up.
i
I have two node one for master (Etcd, control plane) and one for worker (worker)
I got couple of issues at first but resolved it with below steps. [FATAL] Aborting system-agent installation due to requested strict CA verification with no CA checksum provided Fix: Go to Rancher Golbal Settings > agent-tls-mode > Change value from Strict to System Store. ---------------------------------------- time=“2025-09-30T015752Z” level=fatal msg=“invalid value provided for --profile flag” Fix: Changed the security compliance profile to csi from csi-1.23 ----------------------------------------- time=“2025-09-30T020446Z” level=fatal msg=“invalid kernel parameter value vm.overcommit_memory=0 - expected 1\ninvalid kernel parameter value kernel.panic=-1 - expected 10\ninvalid kernel parameter value kernel.panic_on_oops=0 - expected 1\n” Fix: Step 1: Create a System Sysctl Configuration File Create a new file in the /etc/sysctl.d/ directory (e.g., 90-rke2.conf). This is the standard way to apply permanent system tunings. sudo vi /etc/sysctl.d/90-rke2.conf # RKE2/Kubelet required settings vm.overcommit_memory = 1 kernel.panic = 10 kernel.panic_on_oops = 1 Step 2: Load the New Configuration sudo sysctl -p /etc/sysctl.d/90-rke2.conf Step 3: Restart RKE2 With the kernel parameters now matching the expected values, the RKE2 server should pass its internal checks and start successfully. sudo systemctl restart rke2-server sudo journalctl -u rke2-server -f ----------------------------------------- time=“2025-09-30T021156Z” level=fatal msg=“missing required: user: unknown user etcd\nmissing required: group: unknown group etcd\n” rke2-server.service: Main process exited, code=exited, status=1/FAILURE Fix: Step 1: Create the etcd Group sudo groupadd --system etcd Step 2: Create the etcd User sudo useradd --system \ --shell /sbin/nologin \ --comment “etcd service user” \ --gid etcd \ etcd Step 3: Verify and Restart # Verify the user and group are set up id etcd # Restart RKE2 sudo systemctl restart rke2-server sudo journalctl -u rke2-server -f
Now I’m stuck at (waiting for cluster agent to connect)
c
you’re trying to set this up as a hardened cluster (profile: cis) without having hardened the base os first?
i
Yes, that is correct.
c
if you’re just getting started, you might try not doing that, first?
but, if you have everything working, the rke2-server service should be running, and you should be able to use kubectl to interact with the cluster and see what pods are running or not running
i
oh got it. I’ll try that.
Thanks