i just tried to bootstrap a new RKE2 cluster with ...
# general
a
i just tried to bootstrap a new RKE2 cluster with Kubernetes 1.31. 3 nodes each with controlplane, etcd and worker roles, but the provisioning is stuck with
Waiting for cluster agent to connect
. Anyone know what the issue might be?
c
Log into the node and check the rke2-server and rancher-system-agent logs
a
this is all i see
Copy code
Feb 19 15:29:41 test-k8s-01 rancher-system-agent[51959]: time="2025-02-19T15:29:41-08:00" level=info msg="[Applyinator] Command sh [-c rke2 etcd-snapshot list --etcd-s3=false 2>/dev/null] finished with err: <nil> and exit code: 0"
Feb 19 15:29:41 test-k8s-01 rancher-system-agent[51959]: time="2025-02-19T15:29:41-08:00" level=info msg="[K8s] updated plan secret fleet-default/custom-003556cd88c4-machine-plan with feedback"
Feb 19 15:39:42 test-k8s-01 rancher-system-agent[51959]: time="2025-02-19T15:39:42-08:00" level=info msg="[Applyinator] No image provided, creating empty working directory /var/lib/rancher/agent/work/20250219-153942/d3cc32bded59d3f541ee16725603468bd4a9ddb2a602a77e69f4f24f9ab6e137_0"
Feb 19 15:39:42 test-k8s-01 rancher-system-agent[51959]: time="2025-02-19T15:39:42-08:00" level=info msg="[Applyinator] Running command: sh [-c rke2 etcd-snapshot list --etcd-s3=false 2>/dev/null]"
Feb 19 15:39:42 test-k8s-01 rancher-system-agent[51959]: time="2025-02-19T15:39:42-08:00" level=info msg="[d3cc32bded59d3f541ee16725603468bd4a9ddb2a602a77e69f4f24f9ab6e137_0:stdout]: Name                                 Location                                                                              Size     Created"
Feb 19 15:39:42 test-k8s-01 rancher-system-agent[51959]: time="2025-02-19T15:39:42-08:00" level=info msg="[d3cc32bded59d3f541ee16725603468bd4a9ddb2a602a77e69f4f24f9ab6e137_0:stdout]: etcd-snapshot-test-k8s-01-1740006002 file:///var/lib/rancher/rke2/server/db/snapshots/etcd-snapshot-test-k8s-01-1740006002 13500448 2025-02-19T15:00:02-08:00"
Feb 19 15:39:42 test-k8s-01 rancher-system-agent[51959]: time="2025-02-19T15:39:42-08:00" level=info msg="[Applyinator] Command sh [-c rke2 etcd-snapshot list --etcd-s3=false 2>/dev/null] finished with err: <nil> and exit code: 0"
Feb 19 15:39:43 test-k8s-01 rancher-system-agent[51959]: time="2025-02-19T15:39:43-08:00" level=info msg="[K8s] updated plan secret fleet-default/custom-003556cd88c4-machine-plan with feedback"
Feb 19 15:49:43 test-k8s-01 rancher-system-agent[51959]: time="2025-02-19T15:49:43-08:00" level=info msg="[Applyinator] No image provided, creating empty working directory /var/lib/rancher/agent/work/20250219-154943/d3cc32bded59d3f541ee16725603468bd4a9ddb2a602a77e69f4f24f9ab6e137_0"
Feb 19 15:49:43 test-k8s-01 rancher-system-agent[51959]: time="2025-02-19T15:49:43-08:00" level=info msg="[Applyinator] Running command: sh [-c rke2 etcd-snapshot list --etcd-s3=false 2>/dev/null]"
Feb 19 15:49:44 test-k8s-01 rancher-system-agent[51959]: time="2025-02-19T15:49:44-08:00" level=info msg="[d3cc32bded59d3f541ee16725603468bd4a9ddb2a602a77e69f4f24f9ab6e137_0:stdout]: Name                                 Location                                                                              Size     Created"
Feb 19 15:49:44 test-k8s-01 rancher-system-agent[51959]: time="2025-02-19T15:49:44-08:00" level=info msg="[d3cc32bded59d3f541ee16725603468bd4a9ddb2a602a77e69f4f24f9ab6e137_0:stdout]: etcd-snapshot-test-k8s-01-1740006002 file:///var/lib/rancher/rke2/server/db/snapshots/etcd-snapshot-test-k8s-01-1740006002 13500448 2025-02-19T15:00:02-08:00"
Feb 19 15:49:44 test-k8s-01 rancher-system-agent[51959]: time="2025-02-19T15:49:44-08:00" level=info msg="[Applyinator] Command sh [-c rke2 etcd-snapshot list --etcd-s3=false 2>/dev/null] finished with err: <nil> and exit code: 0"
Feb 19 15:49:44 test-k8s-01 rancher-system-agent[51959]: time="2025-02-19T15:49:44-08:00" level=info msg="[K8s] updated plan secret fleet-default/custom-003556cd88c4-machine-plan with feedback"
Copy code
Feb 19 15:00:02 test-k8s-01 rke2[14214]: {"level":"info","ts":"2025-02-19T15:00:02.261986-0800","logger":"rke2-etcd-client.client","caller":"v3@v3.5.16-k3s1/maintenance.go:212","msg":"opened snapshot stream; downloading"}
Feb 19 15:00:02 test-k8s-01 rke2[14214]: {"level":"info","ts":"2025-02-19T15:00:02.262043-0800","logger":"rke2-etcd-client","caller":"snapshot/v3_snapshot.go:73","msg":"fetching snapshot","endpoint":"<https://127.0.0.1:2379>"}
Feb 19 15:00:02 test-k8s-01 rke2[14214]: {"level":"info","ts":"2025-02-19T15:00:02.385360-0800","logger":"rke2-etcd-client.client","caller":"v3@v3.5.16-k3s1/maintenance.go:220","msg":"completed snapshot read; closing"}
Feb 19 15:00:02 test-k8s-01 rke2[14214]: {"level":"info","ts":"2025-02-19T15:00:02.431445-0800","logger":"rke2-etcd-client","caller":"snapshot/v3_snapshot.go:88","msg":"fetched snapshot","endpoint":"<https://127.0.0.1:2379>","size":"14 MB","took":"now"}
Feb 19 15:00:02 test-k8s-01 rke2[14214]: {"level":"info","ts":"2025-02-19T15:00:02.431593-0800","logger":"rke2-etcd-client","caller":"snapshot/v3_snapshot.go:97","msg":"saved","path":"/var/lib/rancher/rke2/server/db/snapshots/etcd-snapshot-test-k8s-01-1740006002"}
Feb 19 15:00:02 test-k8s-01 rke2[14214]: time="2025-02-19T15:00:02-08:00" level=info msg="Saving snapshot metadata to /var/lib/rancher/rke2/server/db/.metadata/etcd-snapshot-test-k8s-01-1740006002"
Feb 19 15:00:02 test-k8s-01 rke2[14214]: time="2025-02-19T15:00:02-08:00" level=info msg="Applying snapshot retention=5 to local snapshots with prefix etcd-snapshot in /var/lib/rancher/rke2/server/db/snapshots"
Feb 19 15:00:02 test-k8s-01 rke2[14214]: time="2025-02-19T15:00:02-08:00" level=info msg="Reconciling ETCDSnapshotFile resources"
Feb 19 15:00:02 test-k8s-01 rke2[14214]: time="2025-02-19T15:00:02-08:00" level=info msg="Reconciliation of ETCDSnapshotFile resources complete"
it's not erroring or move
and it's been like this since i created it
h
What does this show (from CLI of the node)?
Copy code
# export KUBECONFIG=/etc/rancher/rke2/rke2.yaml

# kubectl get nodes
c
Check that the nodes are ready, and check the cluster agent logs in cattle-system namespace.
a
Copy code
NAME          STATUS   ROLES                       AGE   VERSION
test-k8s-01   Ready    control-plane,etcd,master   99m   v1.31.5+rke2r1
it only shows 1 of the 3, and UI still shows waiting for agent
c
ok so look on the other nodes. can they reach this one?
a
they're all on the network, can ping each other. has dns with reverse lookup
c
ok so what do their logs say
a
i'm checking the agent log on 01
c
no, look at the other 2 servers first. what are they waiting for
why are they not in the cluster
a
hmm where do i look for them?
they don't even have rke2-server service on them, not sure why
also, this is what 01 shows for the agent pod
Copy code
[root@test-k8s-01 bin]# kubectl get all -n cattle-system
NAME                                        READY   STATUS    RESTARTS   AGE
pod/cattle-cluster-agent-5c67d4fdbb-9tskf   0/1     Pending   0          100m
the pod is stuck in pending state
no logs
this is all it shows for the system-agent on 02
Copy code
Feb 19 14:59:45 test-k8s-02 systemd[1]: Stopping Rancher System Agent...
Feb 19 14:59:45 test-k8s-02 systemd[1]: rancher-system-agent.service: Deactivated successfully.
Feb 19 14:59:45 test-k8s-02 systemd[1]: Stopped Rancher System Agent.
Feb 19 14:59:45 test-k8s-02 systemd[1]: Started Rancher System Agent.
Feb 19 14:59:45 test-k8s-02 rancher-system-agent[9802]: time="2025-02-19T14:59:45-08:00" level=info msg="Rancher System Agent version v0.3.11 (b8c28d0) is starting"
Feb 19 14:59:45 test-k8s-02 rancher-system-agent[9802]: time="2025-02-19T14:59:45-08:00" level=info msg="Using directory /var/lib/rancher/agent/work for work"
Feb 19 14:59:45 test-k8s-02 rancher-system-agent[9802]: time="2025-02-19T14:59:45-08:00" level=info msg="Starting remote watch of plans"
Feb 19 14:59:45 test-k8s-02 rancher-system-agent[9802]: time="2025-02-19T14:59:45-08:00" level=info msg="Starting /v1, Kind=Secret controller"
h
do you have firewall running on these nodes?
if yes you need to allow these ports https://docs.rke2.io/install/requirements#networking
a
firewalld is turned off
Copy code
[root@test-k8s-02 bin]# systemctl status firewalld
○ firewalld.service - firewalld - dynamic firewall daemon
     Loaded: loaded (/usr/lib/systemd/system/firewalld.service; disabled; preset: enabled)
     Active: inactive (dead)
       Docs: man:firewalld(1)
c
what do you get from
kubectl get pod -A -o wide
on the first node
something on the first node is not ready yet
Do you have enough cpu/memory/disk for all the core pods to be scheduled?
a
i'm only doing a test so it's built it pretty sure, but it's got 4cpu and 8gb mem each node
Copy code
[root@test-k8s-01 bin]# kubectl get pod -A -o wide
NAMESPACE         NAME                                                    READY   STATUS      RESTARTS   AGE    IP             NODE          NOMINATED NODE   READINESS GATES
calico-system     calico-kube-controllers-54ddfbf69b-b7xh4                0/1     Pending     0          121m   <none>         <none>        <none>           <none>
calico-system     calico-node-24w8f                                       0/1     Running     0          121m   10.1.130.218   test-k8s-01   <none>           <none>
calico-system     calico-typha-6dfbbcb6c6-pqvbr                           0/1     Pending     0          121m   <none>         <none>        <none>           <none>
cattle-system     cattle-cluster-agent-5c67d4fdbb-9tskf                   0/1     Pending     0          122m   <none>         <none>        <none>           <none>
kube-system       etcd-test-k8s-01                                        1/1     Running     0          122m   10.1.130.218   test-k8s-01   <none>           <none>
kube-system       helm-install-rancher-vsphere-cpi-q5bt8                  0/1     Completed   0          122m   10.1.130.218   test-k8s-01   <none>           <none>
kube-system       helm-install-rancher-vsphere-csi-cs5nk                  0/1     Completed   0          122m   10.1.130.218   test-k8s-01   <none>           <none>
kube-system       helm-install-rke2-calico-4lw8n                          0/1     Completed   2          122m   10.1.130.218   test-k8s-01   <none>           <none>
kube-system       helm-install-rke2-calico-crd-nhdp5                      0/1     Completed   0          122m   10.1.130.218   test-k8s-01   <none>           <none>
kube-system       helm-install-rke2-coredns-flktc                         0/1     Completed   0          122m   10.1.130.218   test-k8s-01   <none>           <none>
kube-system       helm-install-rke2-ingress-nginx-gvq9s                   0/1     Pending     0          122m   <none>         <none>        <none>           <none>
kube-system       helm-install-rke2-metrics-server-2w6p8                  0/1     Pending     0          122m   <none>         <none>        <none>           <none>
kube-system       helm-install-rke2-runtimeclasses-827p9                  0/1     Pending     0          122m   <none>         <none>        <none>           <none>
kube-system       helm-install-rke2-snapshot-controller-crd-26jc5         0/1     Pending     0          122m   <none>         <none>        <none>           <none>
kube-system       helm-install-rke2-snapshot-controller-tchpk             0/1     Pending     0          122m   <none>         <none>        <none>           <none>
kube-system       kube-apiserver-test-k8s-01                              1/1     Running     0          121m   10.1.130.218   test-k8s-01   <none>           <none>
kube-system       kube-controller-manager-test-k8s-01                     1/1     Running     0          121m   10.1.130.218   test-k8s-01   <none>           <none>
kube-system       kube-proxy-test-k8s-01                                  1/1     Running     0          122m   10.1.130.218   test-k8s-01   <none>           <none>
kube-system       kube-scheduler-test-k8s-01                              1/1     Running     0          121m   10.1.130.218   test-k8s-01   <none>           <none>
kube-system       rancher-vsphere-cpi-cloud-controller-manager-gql4g      1/1     Running     0          122m   10.1.130.218   test-k8s-01   <none>           <none>
kube-system       rke2-coredns-rke2-coredns-55bdf87668-hd8n2              0/1     Pending     0          122m   <none>         <none>        <none>           <none>
kube-system       rke2-coredns-rke2-coredns-autoscaler-65c8c6bd64-77vpn   0/1     Pending     0          122m   <none>         <none>        <none>           <none>
kube-system       vsphere-csi-controller-7f677c5776-66fds                 0/7     Pending     0          122m   <none>         <none>        <none>           <none>
kube-system       vsphere-csi-controller-7f677c5776-jb66x                 0/7     Pending     0          122m   <none>         <none>        <none>           <none>
kube-system       vsphere-csi-controller-7f677c5776-qk4jp                 0/7     Pending     0          122m   <none>         <none>        <none>           <none>
tigera-operator   tigera-operator-8445fdf4df-j5q56                        1/1     Running     0          122m   10.1.130.218   test-k8s-01   <none>           <none>
c
bunch of stuff is pending, describe those pods to see why
in particular check on the calico and coredns pods
a
sec
Copy code
Events:
  Type     Reason     Age                     From     Message
  ----     ------     ----                    ----     -------
  Warning  Unhealthy  3m59s (x480 over 138m)  kubelet  Readiness probe failed: calico/node is not ready: felix is not ready: readiness probe reporting 503
c
check the pending ones. not the running but not ready ones.
a
Copy code
Events:
  Type     Reason            Age                  From               Message
  ----     ------            ----                 ----               -------
  Warning  FailedScheduling  24m (x23 over 134m)  default-scheduler  0/1 nodes are available: 1 node(s) had untolerated taint {<http://node.cloudprovider.kubernetes.io/uninitialized|node.cloudprovider.kubernetes.io/uninitialized>: true}. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling.
hm maybe this is why
i did enable vsphere CPI to test integration with vsphere
c
that indicates that your vsphere cpi isn’t configured properly
a
are there any documentations on how to correctly configure it?
c
check the rancher-vsphere-cpi-cloud-controller-manager-gql4g logs
a
Copy code
I0220 00:52:52.219792       1 node_controller.go:233] error syncing 'test-k8s-01': failed to get instance metadata for node test-k8s-01: failed to get instance ID from cloud provider: unable to find suitable IP address for node test-k8s-01 with IP family [10.1.130.218 10.1.130.219 10.1.130.220], requeuing
E0220 00:52:52.219980       1 node_controller.go:244] "Unhandled Error" err="error syncing 'test-k8s-01': failed to get instance metadata for node test-k8s-01: failed to get instance ID from cloud provider: unable to find suitable IP address for node test-k8s-01 with IP family [10.1.130.218 10.1.130.219 10.1.130.220], requeuing"
I0220 00:54:00.918563       1 node_controller.go:429] Initializing node test-k8s-01 with cloud provider
I0220 00:54:00.918677       1 search.go:76] WhichVCandDCByNodeID nodeID: test-k8s-01
I0220 00:54:00.931101       1 search.go:208] Found node test-k8s-01 as vm=VirtualMachine:vm-211194 in vc=<http://vcenter.torrance.vcf.docmagic.com|vcenter.torrance.vcf.docmagic.com> and datacenter=Torrance
I0220 00:54:00.931126       1 search.go:210] Hostname: test-k8s-01, UUID: 420e2872-4d24-8ad1-22d1-e05b1d10fed8
I0220 00:54:00.931172       1 nodemanager.go:146] Discovered VM using FQDN or short-hand name
I0220 00:54:00.935955       1 nodemanager.go:276] Adding Hostname: test-k8s-01
I0220 00:54:00.935996       1 node_controller.go:233] error syncing 'test-k8s-01': failed to get instance metadata for node test-k8s-01: failed to get instance ID from cloud provider: unable to find suitable IP address for node test-k8s-01 with IP family [10.1.130.218 10.1.130.219 10.1.130.220], requeuing
E0220 00:54:00.936052       1 node_controller.go:244] "Unhandled Error" err="error syncing 'test-k8s-01': failed to get instance metadata for node test-k8s-01: failed to get instance ID from cloud provider: unable to find suitable IP address for node test-k8s-01 with IP family [10.1.130.218 10.1.130.219 10.1.130.220], requeuing"
I0220 00:54:19.985654       1 node_controller.go:271] Update 1 nodes status took 57.4µs.
c
fill out the vsphere stuff properly in the ui…
a
i was not very clear on what the addresses values are supposed to be
c
ok so it is talking to vsphere and found VMs but cant figure out what to do with the IPs
a
image.png
i guess this is the problem
can i leave that blank?
c
ok it tells you what valid values are. it wants “ipv4” or “ipv6" or both but you put in… ip addresses for some reason?
a
oh i see. it wanted literally "ipv4"
c
yes
or just leave it blank and it will try to figure it out on its own
a
gotcha
sorry this is my first time trying the CPI, had zero clue on what to do. we were using the in-tree provider for a long time but now we have to switch to this to go to k8s 1.31
165 Views