This message was deleted.
# general
a
This message was deleted.
l
ok, the first problem (with the node not being provisioned at all) was because containerd was missing on my nodes. I was under the impression that the rancher-system-agent would install all requirements automatically (as does kubespray).
However, the second problem still remains, after creating a new RKE2 cluster in my Docker installation of rancher, the provisioning is stuck at
Copy code
2023/08/13 11:38:21 [INFO] [planner] rkecluster fleet-default/innos: waiting: non-ready bootstrap machine(s) custom-1d2b43432a95 and join url to be available on bootstrap node
The cattle-cluster-agent output only shows
Copy code
time="2023-08-13T11:28:30Z" level=info msg="Starting /v1, Kind=Node controller"
time="2023-08-13T11:28:30Z" level=info msg="Starting <http://catalog.cattle.io/v1|catalog.cattle.io/v1>, Kind=App controller"
time="2023-08-13T11:28:30Z" level=info msg="Starting <http://catalog.cattle.io/v1|catalog.cattle.io/v1>, Kind=Operation controller"
time="2023-08-13T11:28:30Z" level=info msg="Starting apps/v1, Kind=Deployment controller"
time="2023-08-13T11:28:30Z" level=info msg="Starting /v1, Kind=ReplicationController controller"
and all pods are healthy:
Copy code
NAMESPACE       NAME                                                    READY   STATUS      RESTARTS      AGE
cattle-system   cattle-cluster-agent-6879944f84-tnckz                   1/1     Running     0             17m
cattle-system   helm-operation-ft4f9                                    0/2     Completed   0             17m
cattle-system   rancher-webhook-74c9bd4d6-td9vr                         1/1     Running     0             16m
kube-system     cilium-m45hm                                            1/1     Running     0             18m
kube-system     cilium-operator-fdb5c85f8-mvh2l                         1/1     Running     0             18m
kube-system     cilium-operator-fdb5c85f8-rxw5c                         0/1     Pending     0             18m
kube-system     cloud-controller-manager-rancher-node-1                 1/1     Running     1 (18m ago)   18m
kube-system     etcd-rancher-node-1                                     1/1     Running     0             17m
kube-system     helm-install-rke2-cilium-qbpr6                          0/1     Completed   0             18m
kube-system     helm-install-rke2-coredns-v9lh5                         0/1     Completed   0             18m
kube-system     helm-install-rke2-metrics-server-km499                  0/1     Completed   0             18m
kube-system     helm-install-rke2-snapshot-controller-crd-vz7h6         0/1     Completed   0             18m
kube-system     helm-install-rke2-snapshot-controller-l6mf6             0/1     Completed   1             18m
kube-system     helm-install-rke2-snapshot-validation-webhook-tg7b9     0/1     Completed   0             18m
kube-system     kube-apiserver-rancher-node-1                           1/1     Running     0             18m
kube-system     kube-controller-manager-rancher-node-1                  1/1     Running     0             18m
kube-system     kube-proxy-rancher-node-1                               1/1     Running     0             18m
kube-system     kube-scheduler-rancher-node-1                           1/1     Running     0             18m
kube-system     rke2-coredns-rke2-coredns-7c98b7488c-pxkht              1/1     Running     0             18m
kube-system     rke2-coredns-rke2-coredns-autoscaler-65b5bfc754-8q5w6   1/1     Running     0             18m
kube-system     rke2-metrics-server-5bf59cdccb-pgg9p                    1/1     Running     0             17m
kube-system     rke2-snapshot-controller-6f7bbb497d-j2wlf               1/1     Running     0             17m
kube-system     rke2-snapshot-validation-webhook-5c499b5cdd-fnr87       1/1     Running     0             17m
Now the node provisioning stopped working again even after installing containerd manually. Something is not right in my setup 🤔
m
non-ready bootstrap machine(s) custom-1d2b43432a95 and join url to be available on bootstrap node
- Are you doing this as an HA setup or single node - and I assume these are running on different physical or virtual machines, right? The bootstrap node is the first node registered for the cluster, so if that node can't call into Rancher due to a firewall or network routing, this can happen. It can also happen if you don't have set IPs, so if the initial bootstrap node got registered but then relaunched with a different IP (you may want to set a DHCP reservation for the RKE2 nodes or use static IPs), it will try to join a cluster that might no longer exist instead of bootstrapping it.
In the latter case, the easiest way would probably be to delete the rke2 cluster and re-provision it. Alternatively, you could try to bootstrap it manually and then have Rancher adopt it once it's up.
You do not need (and probably should not have) to have containerd installed on the nodes first. The rke2 installer sets up and runs containerd for you.
l
delete the rke2 cluster and re-provision
Thanks, I did that multiple times now and the results are mixed. Registering the cluster worked once or twice but was never repeatable
The rke2 installer sets up and runs containerd for you.
OK, thanks, good to know that this will not fix my problem and I have to look elsewhere.
or if that node can't call into Rancher due to a firewall or network routing, this can happen. It can also happen if you don't have set IPs
Thanks for the advice. At the moment I do not have any firewalls in my local network in place (ufw is disabled on the Ubuntu instances) and all nodes use fixed IP addresses.
The cluster setup is very basic, only opted for cilium instead of calico. However, currently nothing happens at all on the nodes after launching the bootstrap command. The rancher-system-agent is started but that is it.
Cluster YAML
Same situation when running the bootstrap on an OpenSuse Tumbleweed installation compared to Ubuntu 22.04.
Copy code
curl --insecure -fL <https://192.168.178.34:8443/system-agent-install.sh> | sudo sh -s - --server <https://192.168.178.34:8443> --label '<http://cattle.io/os=linux|cattle.io/os=linux>' --token xxx--ca-checksum a0b9c3b2b771127b4b8f3bcf0fb81d1fd535a22f63aea52023d7a3c3f2444c70 --etcd --controlplane
The agent gets installed and started, but nothing happens. The last agent logs are:
Copy code
sudo journalctl -f -u rancher-system-agent
Aug 15 10:50:25 localhost.localdomain systemd[1]: Started Rancher System Agent.
Aug 15 10:50:25 localhost.localdomain rancher-system-agent[1913]: time="2023-08-15T10:50:25+02:00" level=info msg="Rancher System Agent version v0.3.3 (9e827a5) is starting"
Aug 15 10:50:25 localhost.localdomain rancher-system-agent[1913]: time="2023-08-15T10:50:25+02:00" level=info msg="Using directory /var/lib/rancher/agent/work for work"
Aug 15 10:50:25 localhost.localdomain rancher-system-agent[1913]: time="2023-08-15T10:50:25+02:00" level=info msg="Starting remote watch of plans"
Aug 15 10:50:25 localhost.localdomain rancher-system-agent[1913]: E0815 10:50:25.858315    1913 memcache.go:206] couldn't get resource list for <http://management.cattle.io/v3|management.cattle.io/v3>:
Aug 15 10:50:25 localhost.localdomain rancher-system-agent[1913]: time="2023-08-15T10:50:25+02:00" level=info msg="Starting /v1, Kind=Secret controller"
Also the same on OpenSuse Leap, so there must be something wrong with my Rancher installation 🤔
As a summary: • The issue with nodes not running bootstrap commands was actually by design, as I failed to register worker nodes to the cluster. I wanted to set up etcd and control plane nodes first, however, Rancher is waiting until one node of each role has been registered before starting to provision. The info message
waiting for at least one control plane, etcd, and worker node to be registered
already tells me that • The issue with the cluster waiting for node join URLs was most probably related to my cattle-system/rancher svc not having any active endpoints in my Docker setup. I've manually added an endpoint with the IP address of the docker container and fleet is happy again • I tried to bootstrap ARM64 worker nodes with rke2 1.26.7 (raxda rock5b sbcs), however, ARM64 is not yet supported. It looks like it will be supported in rke2 1.27 and hopefully in a future rancher release (currently running Rancher 2.7.5)
g
Any resolution for this? I am stuck on exactly same issue
882 Views