This message was deleted Rancher Users #general

Join Slack

This message was deleted.

# general

adamant-kite-43734

08/13/2023, 10:05 AM

This message was deleted.

limited-ice-90215

08/13/2023, 11:27 AM

ok, the first problem (with the node not being provisioned at all) was because containerd was missing on my nodes. I was under the impression that the rancher-system-agent would install all requirements automatically (as does kubespray).

limited-ice-90215

08/13/2023, 11:45 AM

However, the second problem still remains, after creating a new RKE2 cluster in my Docker installation of rancher, the provisioning is stuck at

Copy code

2023/08/13 11:38:21 [INFO] [planner] rkecluster fleet-default/innos: waiting: non-ready bootstrap machine(s) custom-1d2b43432a95 and join url to be available on bootstrap node

The cattle-cluster-agent output only shows

Copy code

time="2023-08-13T11:28:30Z" level=info msg="Starting /v1, Kind=Node controller"
time="2023-08-13T11:28:30Z" level=info msg="Starting <http://catalog.cattle.io/v1|catalog.cattle.io/v1>, Kind=App controller"
time="2023-08-13T11:28:30Z" level=info msg="Starting <http://catalog.cattle.io/v1|catalog.cattle.io/v1>, Kind=Operation controller"
time="2023-08-13T11:28:30Z" level=info msg="Starting apps/v1, Kind=Deployment controller"
time="2023-08-13T11:28:30Z" level=info msg="Starting /v1, Kind=ReplicationController controller"

and all pods are healthy:

Copy code

NAMESPACE       NAME                                                    READY   STATUS      RESTARTS      AGE
cattle-system   cattle-cluster-agent-6879944f84-tnckz                   1/1     Running     0             17m
cattle-system   helm-operation-ft4f9                                    0/2     Completed   0             17m
cattle-system   rancher-webhook-74c9bd4d6-td9vr                         1/1     Running     0             16m
kube-system     cilium-m45hm                                            1/1     Running     0             18m
kube-system     cilium-operator-fdb5c85f8-mvh2l                         1/1     Running     0             18m
kube-system     cilium-operator-fdb5c85f8-rxw5c                         0/1     Pending     0             18m
kube-system     cloud-controller-manager-rancher-node-1                 1/1     Running     1 (18m ago)   18m
kube-system     etcd-rancher-node-1                                     1/1     Running     0             17m
kube-system     helm-install-rke2-cilium-qbpr6                          0/1     Completed   0             18m
kube-system     helm-install-rke2-coredns-v9lh5                         0/1     Completed   0             18m
kube-system     helm-install-rke2-metrics-server-km499                  0/1     Completed   0             18m
kube-system     helm-install-rke2-snapshot-controller-crd-vz7h6         0/1     Completed   0             18m
kube-system     helm-install-rke2-snapshot-controller-l6mf6             0/1     Completed   1             18m
kube-system     helm-install-rke2-snapshot-validation-webhook-tg7b9     0/1     Completed   0             18m
kube-system     kube-apiserver-rancher-node-1                           1/1     Running     0             18m
kube-system     kube-controller-manager-rancher-node-1                  1/1     Running     0             18m
kube-system     kube-proxy-rancher-node-1                               1/1     Running     0             18m
kube-system     kube-scheduler-rancher-node-1                           1/1     Running     0             18m
kube-system     rke2-coredns-rke2-coredns-7c98b7488c-pxkht              1/1     Running     0             18m
kube-system     rke2-coredns-rke2-coredns-autoscaler-65b5bfc754-8q5w6   1/1     Running     0             18m
kube-system     rke2-metrics-server-5bf59cdccb-pgg9p                    1/1     Running     0             17m
kube-system     rke2-snapshot-controller-6f7bbb497d-j2wlf               1/1     Running     0             17m
kube-system     rke2-snapshot-validation-webhook-5c499b5cdd-fnr87       1/1     Running     0             17m

limited-ice-90215

08/13/2023, 12:33 PM

Now the node provisioning stopped working again even after installing containerd manually. Something is not right in my setup 🤔

miniature-salesclerk-33951

08/14/2023, 4:45 PM

non-ready bootstrap machine(s) custom-1d2b43432a95 and join url to be available on bootstrap node

- Are you doing this as an HA setup or single node - and I assume these are running on different physical or virtual machines, right? The bootstrap node is the first node registered for the cluster, so if that node can't call into Rancher due to a firewall or network routing, this can happen. It can also happen if you don't have set IPs, so if the initial bootstrap node got registered but then relaunched with a different IP (you may want to set a DHCP reservation for the RKE2 nodes or use static IPs), it will try to join a cluster that might no longer exist instead of bootstrapping it.

miniature-salesclerk-33951

08/14/2023, 4:47 PM

In the latter case, the easiest way would probably be to delete the rke2 cluster and re-provision it. Alternatively, you could try to bootstrap it manually and then have Rancher adopt it once it's up.

miniature-salesclerk-33951

08/14/2023, 4:49 PM

You do not need (and probably should not have) to have containerd installed on the nodes first. The rke2 installer sets up and runs containerd for you.

limited-ice-90215

08/15/2023, 8:31 AM

delete the rke2 cluster and re-provision

Thanks, I did that multiple times now and the results are mixed. Registering the cluster worked once or twice but was never repeatable

limited-ice-90215

08/15/2023, 8:33 AM

The rke2 installer sets up and runs containerd for you.

OK, thanks, good to know that this will not fix my problem and I have to look elsewhere.

limited-ice-90215

08/15/2023, 8:34 AM

or if that node can't call into Rancher due to a firewall or network routing, this can happen. It can also happen if you don't have set IPs

Thanks for the advice. At the moment I do not have any firewalls in my local network in place (ufw is disabled on the Ubuntu instances) and all nodes use fixed IP addresses.

limited-ice-90215

08/15/2023, 8:48 AM

The cluster setup is very basic, only opted for cilium instead of calico. However, currently nothing happens at all on the nodes after launching the bootstrap command. The rancher-system-agent is started but that is it.

limited-ice-90215

08/15/2023, 8:49 AM

Cluster YAML

innos.yaml

limited-ice-90215

08/15/2023, 8:56 AM

Same situation when running the bootstrap on an OpenSuse Tumbleweed installation compared to Ubuntu 22.04.

Copy code

curl --insecure -fL <https://192.168.178.34:8443/system-agent-install.sh> | sudo sh -s - --server <https://192.168.178.34:8443> --label '<http://cattle.io/os=linux|cattle.io/os=linux>' --token xxx--ca-checksum a0b9c3b2b771127b4b8f3bcf0fb81d1fd535a22f63aea52023d7a3c3f2444c70 --etcd --controlplane

The agent gets installed and started, but nothing happens. The last agent logs are:

Copy code

sudo journalctl -f -u rancher-system-agent
Aug 15 10:50:25 localhost.localdomain systemd[1]: Started Rancher System Agent.
Aug 15 10:50:25 localhost.localdomain rancher-system-agent[1913]: time="2023-08-15T10:50:25+02:00" level=info msg="Rancher System Agent version v0.3.3 (9e827a5) is starting"
Aug 15 10:50:25 localhost.localdomain rancher-system-agent[1913]: time="2023-08-15T10:50:25+02:00" level=info msg="Using directory /var/lib/rancher/agent/work for work"
Aug 15 10:50:25 localhost.localdomain rancher-system-agent[1913]: time="2023-08-15T10:50:25+02:00" level=info msg="Starting remote watch of plans"
Aug 15 10:50:25 localhost.localdomain rancher-system-agent[1913]: E0815 10:50:25.858315    1913 memcache.go:206] couldn't get resource list for <http://management.cattle.io/v3|management.cattle.io/v3>:
Aug 15 10:50:25 localhost.localdomain rancher-system-agent[1913]: time="2023-08-15T10:50:25+02:00" level=info msg="Starting /v1, Kind=Secret controller"

limited-ice-90215

08/15/2023, 9:23 AM

Also the same on OpenSuse Leap, so there must be something wrong with my Rancher installation 🤔

limited-ice-90215

08/15/2023, 10:22 AM

As a summary: • The issue with nodes not running bootstrap commands was actually by design, as I failed to register worker nodes to the cluster. I wanted to set up etcd and control plane nodes first, however, Rancher is waiting until one node of each role has been registered before starting to provision. The info message

waiting for at least one control plane, etcd, and worker node to be registered

already tells me that • The issue with the cluster waiting for node join URLs was most probably related to my cattle-system/rancher svc not having any active endpoints in my Docker setup. I've manually added an endpoint with the IP address of the docker container and fleet is happy again • I tried to bootstrap ARM64 worker nodes with rke2 1.26.7 (raxda rock5b sbcs), however, ARM64 is not yet supported. It looks like it will be supported in rke2 1.27 and hopefully in a future rancher release (currently running Rancher 2.7.5)

gray-rainbow-72486

10/05/2023, 7:25 AM

Any resolution for this? I am stuck on exactly same issue

912 Views

Open in Slack

Previous Next