Hello all I have a fresh installation of Rancher for dev tes Rancher Users #general

Hello all, I have a fresh installation of Rancher...

modern-yak-14459

05/29/2024, 10:35 AM

Hello all, I have a fresh installation of Rancher for dev/testing . Rancher is installed on K3s via helm chart, all is fine, and UI is accessible and configured. When I want to provision a new cluster using RKE2 in Vsphere, the vm is created and it get assigned an v4 IP, it establishes an initial connection to the Rancher cluster and begins the installation process. The machine/cluster in the UI is stuck in "waiting for cluster agent to connect". The log file for the rancher-system-agent is saying:

Copy code

May 29 10:17:38 test-cluster-pool1-8cedc659-9z2t4 rancher-system-agent[1389]: time="2024-05-29T10:17:38Z" level=error msg="error loading CA cert for probe (kube-controller-manager) /var/lib/rancher/rke2/server/tls/kube-controller-manager/kube-controller-manager.crt: open /var/lib/rancher/rke2/server/tls/kube-controller-manager/kube-controller-manager.crt: no such file or directory"
May 29 10:17:38 test-cluster-pool1-8cedc659-9z2t4 rancher-system-agent[1389]: time="2024-05-29T10:17:38Z" level=error msg="error while appending ca cert to pool for probe kube-controller-manager"
May 29 10:17:38 test-cluster-pool1-8cedc659-9z2t4 rancher-system-agent[1389]: time="2024-05-29T10:17:38Z" level=error msg="error loading CA cert for probe (kube-scheduler) /var/lib/rancher/rke2/server/tls/kube-scheduler/kube-scheduler.crt: open /var/lib/rancher/rke2/server/tls/kube-scheduler/kube-scheduler.crt: no such file or directory"
May 29 10:17:38 test-cluster-pool1-8cedc659-9z2t4 rancher-system-agent[1389]: time="2024-05-29T10:17:38Z" level=error msg="error while appending ca cert to pool for probe kube-scheduler"
May 29 10:17:43 test-cluster-pool1-8cedc659-9z2t4 rancher-system-agent[1389]: time="2024-05-29T10:17:43Z" level=error msg="error loading CA cert for probe (kube-controller-manager) /var/lib/rancher/rke2/server/tls/kube-controller-manager/kube-controller-manager.crt: open /var/lib/rancher/rke2/server/tls/kube-controller-manager/kube-controller-manager.crt: no such file or directory"
May 29 10:17:43 test-cluster-pool1-8cedc659-9z2t4 rancher-system-agent[1389]: time="2024-05-29T10:17:43Z" level=error msg="error while appending ca cert to pool for probe kube-controller-manager"
May 29 10:17:43 test-cluster-pool1-8cedc659-9z2t4 rancher-system-agent[1389]: time="2024-05-29T10:17:43Z" level=error msg="error loading CA cert for probe (kube-scheduler) /var/lib/rancher/rke2/server/tls/kube-scheduler/kube-scheduler.crt: open /var/lib/rancher/rke2/server/tls/kube-scheduler/kube-scheduler.crt: no such file or directory"
May 29 10:17:43 test-cluster-pool1-8cedc659-9z2t4 rancher-system-agent[1389]: time="2024-05-29T10:17:43Z" level=error msg="error while appending ca cert to pool for probe kube-scheduler"
May 29 10:20:22 test-cluster-pool1-8cedc659-9z2t4 rancher-system-agent[1389]: time="2024-05-29T10:20:22Z" level=error msg="[K8s] received secret to process that was older than the last secret operated on. (19332 vs 19349)"
May 29 10:20:22 test-cluster-pool1-8cedc659-9z2t4 rancher-system-agent[1389]: time="2024-05-29T10:20:22Z" level=error msg="error syncing 'fleet-default/test-cluster-bootstrap-template-dlg6f-machine-plan': handler secret-watch: secret received was too old, requeuing"
May 29 10:26:10 test-cluster-pool1-8cedc659-9z2t4 rancher-system-agent[1389]: time="2024-05-29T10:26:10Z" level=info msg="[Applyinator] No image provided, creating empty working directory /var/lib/rancher/agent/work/20240529-102610/bcebc769171ad0d331ab3e189cbaa4be1c7dc417ec0e23f1be95a7d8dbb2363a_0"
May 29 10:26:10 test-cluster-pool1-8cedc659-9z2t4 rancher-system-agent[1389]: time="2024-05-29T10:26:10Z" level=info msg="[Applyinator] Running command: sh [-c rke2 etcd-snapshot list --etcd-s3=false 2>/dev/null]"
May 29 10:26:10 test-cluster-pool1-8cedc659-9z2t4 rancher-system-agent[1389]: time="2024-05-29T10:26:10Z" level=info msg="[bcebc769171ad0d331ab3e189cbaa4be1c7dc417ec0e23f1be95a7d8dbb2363a_0:stdout]: Name Location Size Created"
May 29 10:26:10 test-cluster-pool1-8cedc659-9z2t4 rancher-system-agent[1389]: time="2024-05-29T10:26:10Z" level=info msg="[Applyinator] Command sh [-c rke2 etcd-snapshot list --etcd-s3=false 2>/dev/null] finished with err: <nil> and exit code: 0"

powerful-librarian-10572

05/29/2024, 11:04 AM

I see no actual error out there.

powerful-librarian-10572

05/29/2024, 11:05 AM

For what i'm seeing there are some timezone issues between your k3s cluster and your node, but apart from that this is a regular log from a starting rke2 node. Check rke2-server service for better info if your node still doesnt't show after some minutes

modern-yak-14459

05/29/2024, 11:07 AM

I think I found out the issue, hte vm template has too little space available on disk and it is 100% on /, I will update the thread if this was the root cause.

powerful-librarian-10572

05/29/2024, 11:07 AM

Oh yeah, also even with space left, kubes wouldve registered disk pressure on the node.

powerful-librarian-10572

05/29/2024, 11:07 AM

It wants the node to be like at most 80 or 90% full, otherwise it wont schedule

modern-yak-14459

05/29/2024, 1:47 PM

Alright, so I managed to amend the template and now the rancher-system-agent logs show the following

Copy code

May 29 13:43:56 test-pool1-9ae8550e-r2g9f rancher-system-agent[2728]: time="2024-05-29T13:43:56Z" level=info msg="Rancher System Agent version v0.2.13 (4fa9427) is starting"
May 29 13:43:56 test-pool1-9ae8550e-r2g9f rancher-system-agent[2728]: time="2024-05-29T13:43:56Z" level=info msg="Using directory /var/lib/rancher/agent/work for work"
May 29 13:43:56 test-pool1-9ae8550e-r2g9f rancher-system-agent[2728]: time="2024-05-29T13:43:56Z" level=info msg="Starting remote watch of plans"
May 29 13:43:56 test-pool1-9ae8550e-r2g9f rancher-system-agent[2728]: E0529 13:43:56.632963    2728 memcache.go:206] couldn't get resource list for <http://management.cattle.io/v3|management.cattle.io/v3>:
May 29 13:43:56 test-pool1-9ae8550e-r2g9f rancher-system-agent[2728]: time="2024-05-29T13:43:56Z" level=info msg="Starting /v1, Kind=Secret controller"

Also from the Rancher cluster master the logs from the rancher pods show

Copy code

[rke2configserver] fleet-default/test-pool1-6d5747dfd4-jmjmp machineID: 5b02a3da8ae219776381aede96686381e080630efa28beb995006c71aa9b4af delivering planSecret test-bootstrap-template-vnc5l-machine-plan with token secret fleet-default/test-bootstrap-template-vnc5l-machine-plan-token-cbx7d to system-agent
2024/05/29 13:39:33 [ERROR] error syncing '_all_': handler user-controllers-controller: failed to start user controllers for cluster c-m-ftwjxzm5: ClusterUnavailable 503: cluster not found, requeuing
2024/05/29 13:40:03 [ERROR] error syncing '_all_': handler user-controllers-controller: failed to start user controllers for cluster c-m-ftwjxzm5: ClusterUnavailable 503: cluster not found, requeuing
2024/05/29 13:40:33 [ERROR] error syncing '_all_': handler user-controllers-controller: failed to start user controllers for cluster c-m-ftwjxzm5: ClusterUnavailable 503: cluster not found, requeuing

better-greece-97138

05/29/2024, 11:02 PM

I had all sorts of issues getting this going, are you using Ubuntu? I had to make sure iptables was installed on the template, didn't see that in the docs

better-greece-97138

05/29/2024, 11:02 PM

works well now

powerful-librarian-10572

05/30/2024, 7:40 AM

Installing docker on debian for me makes me install iptables lol

modern-yak-14459

05/30/2024, 8:12 AM

I am using Ubuntu, and the template does have iptables on it, I am still troubleshooting this, I am positive it is something to do with the actual VM that is provisioned, since I tested and other rancher clusters are also not able to create a cluster on that Vsphere, I will see if I can get a template that works migrated from another Vsphere on to this one and see if it works.

powerful-librarian-10572

05/30/2024, 8:12 AM

It may be because you are missing some kernel modules, it's not uncommon for very light vms templates with minimal setups

modern-yak-14459

05/30/2024, 8:14 AM

I agree, also, the problematic Vsphere is using slow mechanical disks and RW delays are not uncommon causing different issues, I will test with a different template from a Vsphere where clusters are being created and let you guys know.

410 Views

Open in Slack

Previous Next