This message was deleted Rancher Users #general

Join Slack

This message was deleted.

# general

adamant-kite-43734

02/19/2025, 11:46 PM

This message was deleted.

creamy-pencil-82913

02/19/2025, 11:57 PM

Log into the node and check the rke2-server and rancher-system-agent logs

able-salesclerk-52921

02/19/2025, 11:59 PM

this is all i see

Copy code

Feb 19 15:29:41 test-k8s-01 rancher-system-agent[51959]: time="2025-02-19T15:29:41-08:00" level=info msg="[Applyinator] Command sh [-c rke2 etcd-snapshot list --etcd-s3=false 2>/dev/null] finished with err: <nil> and exit code: 0"
Feb 19 15:29:41 test-k8s-01 rancher-system-agent[51959]: time="2025-02-19T15:29:41-08:00" level=info msg="[K8s] updated plan secret fleet-default/custom-003556cd88c4-machine-plan with feedback"
Feb 19 15:39:42 test-k8s-01 rancher-system-agent[51959]: time="2025-02-19T15:39:42-08:00" level=info msg="[Applyinator] No image provided, creating empty working directory /var/lib/rancher/agent/work/20250219-153942/d3cc32bded59d3f541ee16725603468bd4a9ddb2a602a77e69f4f24f9ab6e137_0"
Feb 19 15:39:42 test-k8s-01 rancher-system-agent[51959]: time="2025-02-19T15:39:42-08:00" level=info msg="[Applyinator] Running command: sh [-c rke2 etcd-snapshot list --etcd-s3=false 2>/dev/null]"
Feb 19 15:39:42 test-k8s-01 rancher-system-agent[51959]: time="2025-02-19T15:39:42-08:00" level=info msg="[d3cc32bded59d3f541ee16725603468bd4a9ddb2a602a77e69f4f24f9ab6e137_0:stdout]: Name                                 Location                                                                              Size     Created"
Feb 19 15:39:42 test-k8s-01 rancher-system-agent[51959]: time="2025-02-19T15:39:42-08:00" level=info msg="[d3cc32bded59d3f541ee16725603468bd4a9ddb2a602a77e69f4f24f9ab6e137_0:stdout]: etcd-snapshot-test-k8s-01-1740006002 file:///var/lib/rancher/rke2/server/db/snapshots/etcd-snapshot-test-k8s-01-1740006002 13500448 2025-02-19T15:00:02-08:00"
Feb 19 15:39:42 test-k8s-01 rancher-system-agent[51959]: time="2025-02-19T15:39:42-08:00" level=info msg="[Applyinator] Command sh [-c rke2 etcd-snapshot list --etcd-s3=false 2>/dev/null] finished with err: <nil> and exit code: 0"
Feb 19 15:39:43 test-k8s-01 rancher-system-agent[51959]: time="2025-02-19T15:39:43-08:00" level=info msg="[K8s] updated plan secret fleet-default/custom-003556cd88c4-machine-plan with feedback"
Feb 19 15:49:43 test-k8s-01 rancher-system-agent[51959]: time="2025-02-19T15:49:43-08:00" level=info msg="[Applyinator] No image provided, creating empty working directory /var/lib/rancher/agent/work/20250219-154943/d3cc32bded59d3f541ee16725603468bd4a9ddb2a602a77e69f4f24f9ab6e137_0"
Feb 19 15:49:43 test-k8s-01 rancher-system-agent[51959]: time="2025-02-19T15:49:43-08:00" level=info msg="[Applyinator] Running command: sh [-c rke2 etcd-snapshot list --etcd-s3=false 2>/dev/null]"
Feb 19 15:49:44 test-k8s-01 rancher-system-agent[51959]: time="2025-02-19T15:49:44-08:00" level=info msg="[d3cc32bded59d3f541ee16725603468bd4a9ddb2a602a77e69f4f24f9ab6e137_0:stdout]: Name                                 Location                                                                              Size     Created"
Feb 19 15:49:44 test-k8s-01 rancher-system-agent[51959]: time="2025-02-19T15:49:44-08:00" level=info msg="[d3cc32bded59d3f541ee16725603468bd4a9ddb2a602a77e69f4f24f9ab6e137_0:stdout]: etcd-snapshot-test-k8s-01-1740006002 file:///var/lib/rancher/rke2/server/db/snapshots/etcd-snapshot-test-k8s-01-1740006002 13500448 2025-02-19T15:00:02-08:00"
Feb 19 15:49:44 test-k8s-01 rancher-system-agent[51959]: time="2025-02-19T15:49:44-08:00" level=info msg="[Applyinator] Command sh [-c rke2 etcd-snapshot list --etcd-s3=false 2>/dev/null] finished with err: <nil> and exit code: 0"
Feb 19 15:49:44 test-k8s-01 rancher-system-agent[51959]: time="2025-02-19T15:49:44-08:00" level=info msg="[K8s] updated plan secret fleet-default/custom-003556cd88c4-machine-plan with feedback"

able-salesclerk-52921

02/19/2025, 11:59 PM

Copy code

Feb 19 15:00:02 test-k8s-01 rke2[14214]: {"level":"info","ts":"2025-02-19T15:00:02.261986-0800","logger":"rke2-etcd-client.client","caller":"v3@v3.5.16-k3s1/maintenance.go:212","msg":"opened snapshot stream; downloading"}
Feb 19 15:00:02 test-k8s-01 rke2[14214]: {"level":"info","ts":"2025-02-19T15:00:02.262043-0800","logger":"rke2-etcd-client","caller":"snapshot/v3_snapshot.go:73","msg":"fetching snapshot","endpoint":"<https://127.0.0.1:2379>"}
Feb 19 15:00:02 test-k8s-01 rke2[14214]: {"level":"info","ts":"2025-02-19T15:00:02.385360-0800","logger":"rke2-etcd-client.client","caller":"v3@v3.5.16-k3s1/maintenance.go:220","msg":"completed snapshot read; closing"}
Feb 19 15:00:02 test-k8s-01 rke2[14214]: {"level":"info","ts":"2025-02-19T15:00:02.431445-0800","logger":"rke2-etcd-client","caller":"snapshot/v3_snapshot.go:88","msg":"fetched snapshot","endpoint":"<https://127.0.0.1:2379>","size":"14 MB","took":"now"}
Feb 19 15:00:02 test-k8s-01 rke2[14214]: {"level":"info","ts":"2025-02-19T15:00:02.431593-0800","logger":"rke2-etcd-client","caller":"snapshot/v3_snapshot.go:97","msg":"saved","path":"/var/lib/rancher/rke2/server/db/snapshots/etcd-snapshot-test-k8s-01-1740006002"}
Feb 19 15:00:02 test-k8s-01 rke2[14214]: time="2025-02-19T15:00:02-08:00" level=info msg="Saving snapshot metadata to /var/lib/rancher/rke2/server/db/.metadata/etcd-snapshot-test-k8s-01-1740006002"
Feb 19 15:00:02 test-k8s-01 rke2[14214]: time="2025-02-19T15:00:02-08:00" level=info msg="Applying snapshot retention=5 to local snapshots with prefix etcd-snapshot in /var/lib/rancher/rke2/server/db/snapshots"
Feb 19 15:00:02 test-k8s-01 rke2[14214]: time="2025-02-19T15:00:02-08:00" level=info msg="Reconciling ETCDSnapshotFile resources"
Feb 19 15:00:02 test-k8s-01 rke2[14214]: time="2025-02-19T15:00:02-08:00" level=info msg="Reconciliation of ETCDSnapshotFile resources complete"

able-salesclerk-52921

02/20/2025, 12:05 AM

it's not erroring or move

able-salesclerk-52921

02/20/2025, 12:05 AM

and it's been like this since i created it

hundreds-evening-84071

02/20/2025, 12:09 AM

What does this show (from CLI of the node)?

Copy code

# export KUBECONFIG=/etc/rancher/rke2/rke2.yaml

# kubectl get nodes

creamy-pencil-82913

02/20/2025, 12:12 AM

Check that the nodes are ready, and check the cluster agent logs in cattle-system namespace.

able-salesclerk-52921

02/20/2025, 12:14 AM

Copy code

NAME          STATUS   ROLES                       AGE   VERSION
test-k8s-01   Ready    control-plane,etcd,master   99m   v1.31.5+rke2r1

able-salesclerk-52921

02/20/2025, 12:14 AM

it only shows 1 of the 3, and UI still shows waiting for agent

creamy-pencil-82913

02/20/2025, 12:14 AM

ok so look on the other nodes. can they reach this one?

able-salesclerk-52921

02/20/2025, 12:15 AM

they're all on the network, can ping each other. has dns with reverse lookup

creamy-pencil-82913

02/20/2025, 12:15 AM

ok so what do their logs say

able-salesclerk-52921

02/20/2025, 12:15 AM

i'm checking the agent log on 01

creamy-pencil-82913

02/20/2025, 12:16 AM

no, look at the other 2 servers first. what are they waiting for

creamy-pencil-82913

02/20/2025, 12:16 AM

why are they not in the cluster

able-salesclerk-52921

02/20/2025, 12:18 AM

hmm where do i look for them?

able-salesclerk-52921

02/20/2025, 12:18 AM

they don't even have rke2-server service on them, not sure why

able-salesclerk-52921

02/20/2025, 12:18 AM

also, this is what 01 shows for the agent pod

able-salesclerk-52921

02/20/2025, 12:18 AM

Copy code

[root@test-k8s-01 bin]# kubectl get all -n cattle-system
NAME                                        READY   STATUS    RESTARTS   AGE
pod/cattle-cluster-agent-5c67d4fdbb-9tskf   0/1     Pending   0          100m

able-salesclerk-52921

02/20/2025, 12:19 AM

the pod is stuck in pending state

able-salesclerk-52921

02/20/2025, 12:19 AM

no logs

able-salesclerk-52921

02/20/2025, 12:20 AM

this is all it shows for the system-agent on 02

able-salesclerk-52921

02/20/2025, 12:20 AM

Copy code

Feb 19 14:59:45 test-k8s-02 systemd[1]: Stopping Rancher System Agent...
Feb 19 14:59:45 test-k8s-02 systemd[1]: rancher-system-agent.service: Deactivated successfully.
Feb 19 14:59:45 test-k8s-02 systemd[1]: Stopped Rancher System Agent.
Feb 19 14:59:45 test-k8s-02 systemd[1]: Started Rancher System Agent.
Feb 19 14:59:45 test-k8s-02 rancher-system-agent[9802]: time="2025-02-19T14:59:45-08:00" level=info msg="Rancher System Agent version v0.3.11 (b8c28d0) is starting"
Feb 19 14:59:45 test-k8s-02 rancher-system-agent[9802]: time="2025-02-19T14:59:45-08:00" level=info msg="Using directory /var/lib/rancher/agent/work for work"
Feb 19 14:59:45 test-k8s-02 rancher-system-agent[9802]: time="2025-02-19T14:59:45-08:00" level=info msg="Starting remote watch of plans"
Feb 19 14:59:45 test-k8s-02 rancher-system-agent[9802]: time="2025-02-19T14:59:45-08:00" level=info msg="Starting /v1, Kind=Secret controller"

hundreds-evening-84071

02/20/2025, 12:21 AM

do you have firewall running on these nodes?

hundreds-evening-84071

02/20/2025, 12:21 AM

if yes you need to allow these ports https://docs.rke2.io/install/requirements#networking

able-salesclerk-52921

02/20/2025, 12:21 AM

firewalld is turned off

able-salesclerk-52921

02/20/2025, 12:22 AM

Copy code

[root@test-k8s-02 bin]# systemctl status firewalld
○ firewalld.service - firewalld - dynamic firewall daemon
     Loaded: loaded (/usr/lib/systemd/system/firewalld.service; disabled; preset: enabled)
     Active: inactive (dead)
       Docs: man:firewalld(1)

creamy-pencil-82913

02/20/2025, 12:35 AM

what do you get from

kubectl get pod -A -o wide

on the first node

creamy-pencil-82913

02/20/2025, 12:35 AM

something on the first node is not ready yet

creamy-pencil-82913

02/20/2025, 12:35 AM

Do you have enough cpu/memory/disk for all the core pods to be scheduled?

able-salesclerk-52921

02/20/2025, 12:36 AM

i'm only doing a test so it's built it pretty sure, but it's got 4cpu and 8gb mem each node

able-salesclerk-52921

02/20/2025, 12:37 AM

Copy code

[root@test-k8s-01 bin]# kubectl get pod -A -o wide
NAMESPACE         NAME                                                    READY   STATUS      RESTARTS   AGE    IP             NODE          NOMINATED NODE   READINESS GATES
calico-system     calico-kube-controllers-54ddfbf69b-b7xh4                0/1     Pending     0          121m   <none>         <none>        <none>           <none>
calico-system     calico-node-24w8f                                       0/1     Running     0          121m   10.1.130.218   test-k8s-01   <none>           <none>
calico-system     calico-typha-6dfbbcb6c6-pqvbr                           0/1     Pending     0          121m   <none>         <none>        <none>           <none>
cattle-system     cattle-cluster-agent-5c67d4fdbb-9tskf                   0/1     Pending     0          122m   <none>         <none>        <none>           <none>
kube-system       etcd-test-k8s-01                                        1/1     Running     0          122m   10.1.130.218   test-k8s-01   <none>           <none>
kube-system       helm-install-rancher-vsphere-cpi-q5bt8                  0/1     Completed   0          122m   10.1.130.218   test-k8s-01   <none>           <none>
kube-system       helm-install-rancher-vsphere-csi-cs5nk                  0/1     Completed   0          122m   10.1.130.218   test-k8s-01   <none>           <none>
kube-system       helm-install-rke2-calico-4lw8n                          0/1     Completed   2          122m   10.1.130.218   test-k8s-01   <none>           <none>
kube-system       helm-install-rke2-calico-crd-nhdp5                      0/1     Completed   0          122m   10.1.130.218   test-k8s-01   <none>           <none>
kube-system       helm-install-rke2-coredns-flktc                         0/1     Completed   0          122m   10.1.130.218   test-k8s-01   <none>           <none>
kube-system       helm-install-rke2-ingress-nginx-gvq9s                   0/1     Pending     0          122m   <none>         <none>        <none>           <none>
kube-system       helm-install-rke2-metrics-server-2w6p8                  0/1     Pending     0          122m   <none>         <none>        <none>           <none>
kube-system       helm-install-rke2-runtimeclasses-827p9                  0/1     Pending     0          122m   <none>         <none>        <none>           <none>
kube-system       helm-install-rke2-snapshot-controller-crd-26jc5         0/1     Pending     0          122m   <none>         <none>        <none>           <none>
kube-system       helm-install-rke2-snapshot-controller-tchpk             0/1     Pending     0          122m   <none>         <none>        <none>           <none>
kube-system       kube-apiserver-test-k8s-01                              1/1     Running     0          121m   10.1.130.218   test-k8s-01   <none>           <none>
kube-system       kube-controller-manager-test-k8s-01                     1/1     Running     0          121m   10.1.130.218   test-k8s-01   <none>           <none>
kube-system       kube-proxy-test-k8s-01                                  1/1     Running     0          122m   10.1.130.218   test-k8s-01   <none>           <none>
kube-system       kube-scheduler-test-k8s-01                              1/1     Running     0          121m   10.1.130.218   test-k8s-01   <none>           <none>
kube-system       rancher-vsphere-cpi-cloud-controller-manager-gql4g      1/1     Running     0          122m   10.1.130.218   test-k8s-01   <none>           <none>
kube-system       rke2-coredns-rke2-coredns-55bdf87668-hd8n2              0/1     Pending     0          122m   <none>         <none>        <none>           <none>
kube-system       rke2-coredns-rke2-coredns-autoscaler-65c8c6bd64-77vpn   0/1     Pending     0          122m   <none>         <none>        <none>           <none>
kube-system       vsphere-csi-controller-7f677c5776-66fds                 0/7     Pending     0          122m   <none>         <none>        <none>           <none>
kube-system       vsphere-csi-controller-7f677c5776-jb66x                 0/7     Pending     0          122m   <none>         <none>        <none>           <none>
kube-system       vsphere-csi-controller-7f677c5776-qk4jp                 0/7     Pending     0          122m   <none>         <none>        <none>           <none>
tigera-operator   tigera-operator-8445fdf4df-j5q56                        1/1     Running     0          122m   10.1.130.218   test-k8s-01   <none>           <none>

creamy-pencil-82913

02/20/2025, 12:42 AM

bunch of stuff is pending, describe those pods to see why

creamy-pencil-82913

02/20/2025, 12:44 AM

in particular check on the calico and coredns pods

able-salesclerk-52921

02/20/2025, 12:52 AM

sec

able-salesclerk-52921

02/20/2025, 12:53 AM

Copy code

Events:
  Type     Reason     Age                     From     Message
  ----     ------     ----                    ----     -------
  Warning  Unhealthy  3m59s (x480 over 138m)  kubelet  Readiness probe failed: calico/node is not ready: felix is not ready: readiness probe reporting 503

creamy-pencil-82913

02/20/2025, 12:54 AM

check the pending ones. not the running but not ready ones.

able-salesclerk-52921

02/20/2025, 12:54 AM

Copy code

Events:
  Type     Reason            Age                  From               Message
  ----     ------            ----                 ----               -------
  Warning  FailedScheduling  24m (x23 over 134m)  default-scheduler  0/1 nodes are available: 1 node(s) had untolerated taint {<http://node.cloudprovider.kubernetes.io/uninitialized|node.cloudprovider.kubernetes.io/uninitialized>: true}. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling.

able-salesclerk-52921

02/20/2025, 12:54 AM

hm maybe this is why

able-salesclerk-52921

02/20/2025, 12:55 AM

i did enable vsphere CPI to test integration with vsphere

creamy-pencil-82913

02/20/2025, 12:55 AM

that indicates that your vsphere cpi isn’t configured properly

able-salesclerk-52921

02/20/2025, 12:55 AM

are there any documentations on how to correctly configure it?

creamy-pencil-82913

02/20/2025, 12:55 AM

check the rancher-vsphere-cpi-cloud-controller-manager-gql4g logs

able-salesclerk-52921

02/20/2025, 12:56 AM

Copy code

I0220 00:52:52.219792       1 node_controller.go:233] error syncing 'test-k8s-01': failed to get instance metadata for node test-k8s-01: failed to get instance ID from cloud provider: unable to find suitable IP address for node test-k8s-01 with IP family [10.1.130.218 10.1.130.219 10.1.130.220], requeuing
E0220 00:52:52.219980       1 node_controller.go:244] "Unhandled Error" err="error syncing 'test-k8s-01': failed to get instance metadata for node test-k8s-01: failed to get instance ID from cloud provider: unable to find suitable IP address for node test-k8s-01 with IP family [10.1.130.218 10.1.130.219 10.1.130.220], requeuing"
I0220 00:54:00.918563       1 node_controller.go:429] Initializing node test-k8s-01 with cloud provider
I0220 00:54:00.918677       1 search.go:76] WhichVCandDCByNodeID nodeID: test-k8s-01
I0220 00:54:00.931101       1 search.go:208] Found node test-k8s-01 as vm=VirtualMachine:vm-211194 in vc=<http://vcenter.torrance.vcf.docmagic.com|vcenter.torrance.vcf.docmagic.com> and datacenter=Torrance
I0220 00:54:00.931126       1 search.go:210] Hostname: test-k8s-01, UUID: 420e2872-4d24-8ad1-22d1-e05b1d10fed8
I0220 00:54:00.931172       1 nodemanager.go:146] Discovered VM using FQDN or short-hand name
I0220 00:54:00.935955       1 nodemanager.go:276] Adding Hostname: test-k8s-01
I0220 00:54:00.935996       1 node_controller.go:233] error syncing 'test-k8s-01': failed to get instance metadata for node test-k8s-01: failed to get instance ID from cloud provider: unable to find suitable IP address for node test-k8s-01 with IP family [10.1.130.218 10.1.130.219 10.1.130.220], requeuing
E0220 00:54:00.936052       1 node_controller.go:244] "Unhandled Error" err="error syncing 'test-k8s-01': failed to get instance metadata for node test-k8s-01: failed to get instance ID from cloud provider: unable to find suitable IP address for node test-k8s-01 with IP family [10.1.130.218 10.1.130.219 10.1.130.220], requeuing"
I0220 00:54:19.985654       1 node_controller.go:271] Update 1 nodes status took 57.4µs.

creamy-pencil-82913

02/20/2025, 12:56 AM

fill out the vsphere stuff properly in the ui…

able-salesclerk-52921

02/20/2025, 12:57 AM

i was not very clear on what the addresses values are supposed to be

creamy-pencil-82913

02/20/2025, 12:57 AM

ok so it is talking to vsphere and found VMs but cant figure out what to do with the IPs

able-salesclerk-52921

02/20/2025, 12:58 AM

i guess this is the problem

able-salesclerk-52921

02/20/2025, 12:58 AM

can i leave that blank?

creamy-pencil-82913

02/20/2025, 12:58 AM

ok it tells you what valid values are. it wants “ipv4” or “ipv6" or both but you put in… ip addresses for some reason?

able-salesclerk-52921

02/20/2025, 12:59 AM

oh i see. it wanted literally "ipv4"

creamy-pencil-82913

02/20/2025, 12:59 AM

yes

creamy-pencil-82913

02/20/2025, 12:59 AM

or just leave it blank and it will try to figure it out on its own

able-salesclerk-52921

02/20/2025, 12:59 AM

gotcha

able-salesclerk-52921

02/20/2025, 1:00 AM

sorry this is my first time trying the CPI, had zero clue on what to do. we were using the in-tree provider for a long time but now we have to switch to this to go to k8s 1.31

281 Views

Open in Slack

Previous Next