Hello all, I have a fresh installation of Rancher...
# general
m
Hello all, I have a fresh installation of Rancher for dev/testing . Rancher is installed on K3s via helm chart, all is fine, and UI is accessible and configured. When I want to provision a new cluster using RKE2 in Vsphere, the vm is created and it get assigned an v4 IP, it establishes an initial connection to the Rancher cluster and begins the installation process. The machine/cluster in the UI is stuck in "waiting for cluster agent to connect". The log file for the rancher-system-agent is saying:
Copy code
May 29 10:17:38 test-cluster-pool1-8cedc659-9z2t4 rancher-system-agent[1389]: time="2024-05-29T10:17:38Z" level=error msg="error loading CA cert for probe (kube-controller-manager) /var/lib/rancher/rke2/server/tls/kube-controller-manager/kube-controller-manager.crt: open /var/lib/rancher/rke2/server/tls/kube-controller-manager/kube-controller-manager.crt: no such file or directory"
May 29 10:17:38 test-cluster-pool1-8cedc659-9z2t4 rancher-system-agent[1389]: time="2024-05-29T10:17:38Z" level=error msg="error while appending ca cert to pool for probe kube-controller-manager"
May 29 10:17:38 test-cluster-pool1-8cedc659-9z2t4 rancher-system-agent[1389]: time="2024-05-29T10:17:38Z" level=error msg="error loading CA cert for probe (kube-scheduler) /var/lib/rancher/rke2/server/tls/kube-scheduler/kube-scheduler.crt: open /var/lib/rancher/rke2/server/tls/kube-scheduler/kube-scheduler.crt: no such file or directory"
May 29 10:17:38 test-cluster-pool1-8cedc659-9z2t4 rancher-system-agent[1389]: time="2024-05-29T10:17:38Z" level=error msg="error while appending ca cert to pool for probe kube-scheduler"
May 29 10:17:43 test-cluster-pool1-8cedc659-9z2t4 rancher-system-agent[1389]: time="2024-05-29T10:17:43Z" level=error msg="error loading CA cert for probe (kube-controller-manager) /var/lib/rancher/rke2/server/tls/kube-controller-manager/kube-controller-manager.crt: open /var/lib/rancher/rke2/server/tls/kube-controller-manager/kube-controller-manager.crt: no such file or directory"
May 29 10:17:43 test-cluster-pool1-8cedc659-9z2t4 rancher-system-agent[1389]: time="2024-05-29T10:17:43Z" level=error msg="error while appending ca cert to pool for probe kube-controller-manager"
May 29 10:17:43 test-cluster-pool1-8cedc659-9z2t4 rancher-system-agent[1389]: time="2024-05-29T10:17:43Z" level=error msg="error loading CA cert for probe (kube-scheduler) /var/lib/rancher/rke2/server/tls/kube-scheduler/kube-scheduler.crt: open /var/lib/rancher/rke2/server/tls/kube-scheduler/kube-scheduler.crt: no such file or directory"
May 29 10:17:43 test-cluster-pool1-8cedc659-9z2t4 rancher-system-agent[1389]: time="2024-05-29T10:17:43Z" level=error msg="error while appending ca cert to pool for probe kube-scheduler"
May 29 10:20:22 test-cluster-pool1-8cedc659-9z2t4 rancher-system-agent[1389]: time="2024-05-29T10:20:22Z" level=error msg="[K8s] received secret to process that was older than the last secret operated on. (19332 vs 19349)"
May 29 10:20:22 test-cluster-pool1-8cedc659-9z2t4 rancher-system-agent[1389]: time="2024-05-29T10:20:22Z" level=error msg="error syncing 'fleet-default/test-cluster-bootstrap-template-dlg6f-machine-plan': handler secret-watch: secret received was too old, requeuing"
May 29 10:26:10 test-cluster-pool1-8cedc659-9z2t4 rancher-system-agent[1389]: time="2024-05-29T10:26:10Z" level=info msg="[Applyinator] No image provided, creating empty working directory /var/lib/rancher/agent/work/20240529-102610/bcebc769171ad0d331ab3e189cbaa4be1c7dc417ec0e23f1be95a7d8dbb2363a_0"
May 29 10:26:10 test-cluster-pool1-8cedc659-9z2t4 rancher-system-agent[1389]: time="2024-05-29T10:26:10Z" level=info msg="[Applyinator] Running command: sh [-c rke2 etcd-snapshot list --etcd-s3=false 2>/dev/null]"
May 29 10:26:10 test-cluster-pool1-8cedc659-9z2t4 rancher-system-agent[1389]: time="2024-05-29T10:26:10Z" level=info msg="[bcebc769171ad0d331ab3e189cbaa4be1c7dc417ec0e23f1be95a7d8dbb2363a_0:stdout]: Name Location Size Created"
May 29 10:26:10 test-cluster-pool1-8cedc659-9z2t4 rancher-system-agent[1389]: time="2024-05-29T10:26:10Z" level=info msg="[Applyinator] Command sh [-c rke2 etcd-snapshot list --etcd-s3=false 2>/dev/null] finished with err: <nil> and exit code: 0"
p
I see no actual error out there.
For what i'm seeing there are some timezone issues between your k3s cluster and your node, but apart from that this is a regular log from a starting rke2 node. Check rke2-server service for better info if your node still doesnt't show after some minutes
m
I think I found out the issue, hte vm template has too little space available on disk and it is 100% on /, I will update the thread if this was the root cause.
p
Oh yeah, also even with space left, kubes wouldve registered disk pressure on the node.
It wants the node to be like at most 80 or 90% full, otherwise it wont schedule
m
Alright, so I managed to amend the template and now the rancher-system-agent logs show the following
Copy code
May 29 13:43:56 test-pool1-9ae8550e-r2g9f rancher-system-agent[2728]: time="2024-05-29T13:43:56Z" level=info msg="Rancher System Agent version v0.2.13 (4fa9427) is starting"
May 29 13:43:56 test-pool1-9ae8550e-r2g9f rancher-system-agent[2728]: time="2024-05-29T13:43:56Z" level=info msg="Using directory /var/lib/rancher/agent/work for work"
May 29 13:43:56 test-pool1-9ae8550e-r2g9f rancher-system-agent[2728]: time="2024-05-29T13:43:56Z" level=info msg="Starting remote watch of plans"
May 29 13:43:56 test-pool1-9ae8550e-r2g9f rancher-system-agent[2728]: E0529 13:43:56.632963    2728 memcache.go:206] couldn't get resource list for <http://management.cattle.io/v3|management.cattle.io/v3>:
May 29 13:43:56 test-pool1-9ae8550e-r2g9f rancher-system-agent[2728]: time="2024-05-29T13:43:56Z" level=info msg="Starting /v1, Kind=Secret controller"
Also from the Rancher cluster master the logs from the rancher pods show
Copy code
[rke2configserver] fleet-default/test-pool1-6d5747dfd4-jmjmp machineID: 5b02a3da8ae219776381aede96686381e080630efa28beb995006c71aa9b4af delivering planSecret test-bootstrap-template-vnc5l-machine-plan with token secret fleet-default/test-bootstrap-template-vnc5l-machine-plan-token-cbx7d to system-agent
2024/05/29 13:39:33 [ERROR] error syncing '_all_': handler user-controllers-controller: failed to start user controllers for cluster c-m-ftwjxzm5: ClusterUnavailable 503: cluster not found, requeuing
2024/05/29 13:40:03 [ERROR] error syncing '_all_': handler user-controllers-controller: failed to start user controllers for cluster c-m-ftwjxzm5: ClusterUnavailable 503: cluster not found, requeuing
2024/05/29 13:40:33 [ERROR] error syncing '_all_': handler user-controllers-controller: failed to start user controllers for cluster c-m-ftwjxzm5: ClusterUnavailable 503: cluster not found, requeuing
b
I had all sorts of issues getting this going, are you using Ubuntu? I had to make sure iptables was installed on the template, didn't see that in the docs
works well now
p
Installing docker on debian for me makes me install iptables lol
m
I am using Ubuntu, and the template does have iptables on it, I am still troubleshooting this, I am positive it is something to do with the actual VM that is provisioned, since I tested and other rancher clusters are also not able to create a cluster on that Vsphere, I will see if I can get a template that works migrated from another Vsphere on to this one and see if it works.
p
It may be because you are missing some kernel modules, it's not uncommon for very light vms templates with minimal setups
m
I agree, also, the problematic Vsphere is using slow mechanical disks and RW delays are not uncommon causing different issues, I will test with a different template from a Vsphere where clusters are being created and let you guys know.
339 Views