This message was deleted Rancher Users #general

Join Slack

This message was deleted.

# general

adamant-kite-43734

11/06/2024, 7:30 PM

This message was deleted.

creamy-pencil-82913

11/06/2024, 7:34 PM

1. Your message is unclear. These logs appear to be from a cluster node that rancher has provisioned, not the cluster that rancher is running on? What exactly is failing here? 2. Rancher 2.9.0 is from July. Is there any reason you’re not using 2.9.3?

purple-city-12222

11/06/2024, 7:58 PM

okay, the log is belong to the rancher UI not the cluster which is provisioned (un-complete) using Rancher , I just logged in again and here are the latest logs for rancher ui :

Copy code

2024/11/06 19:29:01 [ERROR] error syncing '_all_': handler user-controllers-controller: userControllersController: failed to set peers for key _all_: failed to start user controllers for cluster c-m-ssbpz56l: ClusterUnavailable 503: cluster not found, requeuing
2024/11/06 19:30:34 [ERROR] error syncing 'c-m-ssbpz56l': handler cluster-deploy: cluster context c-m-ssbpz56l is unavailable, requeuing
2024/11/06 19:30:34 [INFO] [planner] rkecluster fleet-default/demo1: configuring bootstrap node(s) demo1-cp-nt9h9-qxx28: waiting for cluster agent to connect
2024/11/06 19:31:01 [ERROR] error syncing '_all_': handler user-controllers-controller: userControllersController: failed to set peers for key _all_: failed to start user controllers for cluster c-m-ssbpz56l: ClusterUnavailable 503: cluster not found, requeuing
2024/11/06 19:32:34 [ERROR] error syncing 'c-m-ssbpz56l': handler cluster-deploy: cluster context c-m-ssbpz56l is unavailable, requeuing
2024/11/06 19:32:34 [INFO] [planner] rkecluster fleet-default/demo1: configuring bootstrap node(s) demo1-cp-nt9h9-qxx28: waiting for cluster agent to connect
2024/11/06 19:33:01 [ERROR] error syncing '_all_': handler user-controllers-controller: userControllersController: failed to set peers for key _all_: failed to start user controllers for cluster c-m-ssbpz56l: ClusterUnavailable 503: cluster not found, requeuing
2024/11/06 19:34:34 [ERROR] error syncing 'c-m-ssbpz56l': handler cluster-deploy: cluster context c-m-ssbpz56l is unavailable, requeuing
2024/11/06 19:34:34 [INFO] [planner] rkecluster fleet-default/demo1: configuring bootstrap node(s) demo1-cp-nt9h9-qxx28: waiting for cluster agent to connect
2024/11/06 19:35:01 [ERROR] error syncing '_all_': handler user-controllers-controller: userControllersController: failed to set peers for key _all_: failed to start user controllers for cluster c-m-ssbpz56l: ClusterUnavailable 503: cluster not found, requeuing
2024/11/06 19:36:34 [ERROR] error syncing 'c-m-ssbpz56l': handler cluster-deploy: cluster context c-m-ssbpz56l is unavailable, requeuing
2024/11/06 19:36:34 [INFO] [planner] rkecluster fleet-default/demo1: configuring bootstrap node(s) demo1-cp-nt9h9-qxx28: waiting for cluster agent to connect
2024/11/06 19:37:01 [ERROR] error syncing '_all_': handler user-controllers-controller: userControllersController: failed to set peers for key _all_: failed to start user controllers for cluster c-m-ssbpz56l: ClusterUnavailable 503: cluster not found, requeuing
2024/11/06 19:38:07 [INFO] [planner] rkecluster fleet-default/demo1: configuring bootstrap node(s) demo1-cp-nt9h9-qxx28: waiting for cluster agent to connect
2024/11/06 19:38:34 [ERROR] error syncing 'c-m-ssbpz56l': handler cluster-deploy: cluster context c-m-ssbpz56l is unavailable, requeuing
2024/11/06 19:38:34 [INFO] [planner] rkecluster fleet-default/demo1: configuring bootstrap node(s) demo1-cp-nt9h9-qxx28: waiting for cluster agent to connect
2024/11/06 19:39:01 [ERROR] error syncing '_all_': handler user-controllers-controller: userControllersController: failed to set peers for key _all_: failed to start user controllers for cluster c-m-ssbpz56l: ClusterUnavailable 503: cluster not found, requeuing
2024/11/06 19:40:35 [ERROR] error syncing 'c-m-ssbpz56l': handler cluster-deploy: cluster context c-m-ssbpz56l is unavailable, requeuing
2024/11/06 19:40:35 [INFO] [planner] rkecluster fleet-default/demo1: configuring bootstrap node(s) demo1-cp-nt9h9-qxx28: waiting for cluster agent to connect
2024/11/06 19:41:01 [ERROR] error syncing '_all_': handler user-controllers-controller: userControllersController: failed to set peers for key _all_: failed to start user controllers for cluster c-m-ssbpz56l: ClusterUnavailable 503: cluster not found, requeuing
2024/11/06 19:42:35 [ERROR] error syncing 'c-m-ssbpz56l': handler cluster-deploy: cluster context c-m-ssbpz56l is unavailable, requeuing
2024/11/06 19:42:35 [INFO] [planner] rkecluster fleet-default/demo1: configuring bootstrap node(s) demo1-cp-nt9h9-qxx28: waiting for cluster agent to connect
2024/11/06 19:43:01 [ERROR] error syncing '_all_': handler user-controllers-controller: userControllersController: failed to set peers for key _all_: failed to start user controllers for cluster c-m-ssbpz56l: ClusterUnavailable 503: cluster not found, requeuing
2024/11/06 19:44:35 [ERROR] error syncing 'c-m-ssbpz56l': handler cluster-deploy: cluster context c-m-ssbpz56l is unavailable, requeuing
2024/11/06 19:44:35 [INFO] [planner] rkecluster fleet-default/demo1: configuring bootstrap node(s) demo1-cp-nt9h9-qxx28: waiting for cluster agent to connect
2024/11/06 19:45:01 [ERROR] error syncing '_all_': handler user-controllers-controller: userControllersController: failed to set peers for key _all_: failed to start user controllers for cluster c-m-ssbpz56l: ClusterUnavailable 503: cluster not found, requeuing
2024/11/06 19:46:35 [ERROR] error syncing 'c-m-ssbpz56l': handler cluster-deploy: cluster context c-m-ssbpz56l is unavailable, requeuing
2024/11/06 19:46:35 [INFO] [planner] rkecluster fleet-default/demo1: configuring bootstrap node(s) demo1-cp-nt9h9-qxx28: waiting for cluster agent to connect
2024/11/06 19:47:01 [ERROR] error syncing '_all_': handler user-controllers-controller: userControllersController: failed to set peers for key _all_: failed to start user controllers for cluster c-m-ssbpz56l: ClusterUnavailable 503: cluster not found, requeuing
2024/11/06 19:48:08 [INFO] [planner] rkecluster fleet-default/demo1: configuring bootstrap node(s) demo1-cp-nt9h9-qxx28: waiting for cluster agent to connect
2024/11/06 19:48:35 [ERROR] error syncing 'c-m-ssbpz56l': handler cluster-deploy: cluster context c-m-ssbpz56l is unavailable, requeuing
2024/11/06 19:48:35 [INFO] [planner] rkecluster fleet-default/demo1: configuring bootstrap node(s) demo1-cp-nt9h9-qxx28: waiting for cluster agent to connect
2024/11/06 19:49:01 [ERROR] error syncing '_all_': handler user-controllers-controller: userControllersController: failed to set peers for key _all_: failed to start user controllers for cluster c-m-ssbpz56l: ClusterUnavailable 503: cluster not found, requeuing
2024/11/06 19:50:35 [ERROR] error syncing 'c-m-ssbpz56l': handler cluster-deploy: cluster context c-m-ssbpz56l is unavailable, requeuing
2024/11/06 19:50:35 [INFO] [planner] rkecluster fleet-default/demo1: configuring bootstrap node(s) demo1-cp-nt9h9-qxx28: waiting for cluster agent to connect
2024/11/06 19:50:49 [INFO] Purged 1 expired tokens
2024/11/06 19:51:01 [ERROR] error syncing '_all_': handler user-controllers-controller: userControllersController: failed to set peers for key _all_: failed to start user controllers for cluster c-m-ssbpz56l: ClusterUnavailable 503: cluster not found, requeuing
2024/11/06 19:52:35 [ERROR] error syncing 'c-m-ssbpz56l': handler cluster-deploy: cluster context c-m-ssbpz56l is unavailable, requeuing
2024/11/06 19:52:35 [INFO] [planner] rkecluster fleet-default/demo1: configuring bootstrap node(s) demo1-cp-nt9h9-qxx28: waiting for cluster agent to connect
2024/11/06 19:53:01 [ERROR] error syncing '_all_': handler user-controllers-controller: userControllersController: failed to set peers for key _all_: failed to start user controllers for cluster c-m-ssbpz56l: ClusterUnavailable 503: cluster not found, requeuing
2024/11/06 19:54:35 [ERROR] error syncing 'c-m-ssbpz56l': handler cluster-deploy: cluster context c-m-ssbpz56l is unavailable, requeuing
2024/11/06 19:54:35 [INFO] [planner] rkecluster fleet-default/demo1: configuring bootstrap node(s) demo1-cp-nt9h9-qxx28: waiting for cluster agent to connect
2024/11/06 19:55:01 [ERROR] error syncing '_all_': handler user-controllers-controller: userControllersController: failed to set peers for key _all_: failed to start user controllers for cluster c-m-ssbpz56l: ClusterUnavailable 503: cluster not found, requeuing
2024/11/06 19:56:35 [ERROR] error syncing 'c-m-ssbpz56l': handler cluster-deploy: cluster context c-m-ssbpz56l is unavailable, requeuing
2024/11/06 19:56:35 [INFO] [planner] rkecluster fleet-default/demo1: configuring bootstrap node(s) demo1-cp-nt9h9-qxx28: waiting for cluster agent to connect

purple-city-12222

11/06/2024, 7:59 PM

and here is the one of the 3 cp nodes:

Copy code

root@demo1-cp-nt9h9-qxx28:~# k get node
NAME                   STATUS   ROLES                       AGE   VERSION
demo1-cp-nt9h9-qxx28   Ready    control-plane,etcd,master   16h   v1.30.5+rke2r1

purple-city-12222

11/06/2024, 7:59 PM

Copy code

root@demo1-cp-nt9h9-qxx28:~# journalctl -u rke2-server.service -f
Nov 06 20:00:04 demo1-cp-nt9h9-qxx28 rke2[2175]: {"level":"info","ts":"2024-11-06T20:00:04.231606+0300","logger":"etcd-client","caller":"snapshot/v3_snapshot.go:65","msg":"created temporary db file","path":"/var/lib/rancher/rke2/server/db/snapshots/etcd-snapshot-demo1-cp-nt9h9-qxx28-1730912404.part"}
Nov 06 20:00:04 demo1-cp-nt9h9-qxx28 rke2[2175]: {"level":"info","ts":"2024-11-06T20:00:04.235744+0300","logger":"etcd-client.client","caller":"v3@v3.5.13-k3s1/maintenance.go:212","msg":"opened snapshot stream; downloading"}
Nov 06 20:00:04 demo1-cp-nt9h9-qxx28 rke2[2175]: {"level":"info","ts":"2024-11-06T20:00:04.235802+0300","logger":"etcd-client","caller":"snapshot/v3_snapshot.go:73","msg":"fetching snapshot","endpoint":"<https://127.0.0.1:2379>"}
Nov 06 20:00:04 demo1-cp-nt9h9-qxx28 rke2[2175]: {"level":"info","ts":"2024-11-06T20:00:04.35526+0300","logger":"etcd-client.client","caller":"v3@v3.5.13-k3s1/maintenance.go:220","msg":"completed snapshot read; closing"}
Nov 06 20:00:04 demo1-cp-nt9h9-qxx28 rke2[2175]: {"level":"info","ts":"2024-11-06T20:00:04.40446+0300","logger":"etcd-client","caller":"snapshot/v3_snapshot.go:88","msg":"fetched snapshot","endpoint":"<https://127.0.0.1:2379>","size":"11 MB","took":"now"}
Nov 06 20:00:04 demo1-cp-nt9h9-qxx28 rke2[2175]: {"level":"info","ts":"2024-11-06T20:00:04.404616+0300","logger":"etcd-client","caller":"snapshot/v3_snapshot.go:97","msg":"saved","path":"/var/lib/rancher/rke2/server/db/snapshots/etcd-snapshot-demo1-cp-nt9h9-qxx28-1730912404"}
Nov 06 20:00:04 demo1-cp-nt9h9-qxx28 rke2[2175]: time="2024-11-06T20:00:04+03:00" level=info msg="Saving snapshot metadata to /var/lib/rancher/rke2/server/db/.metadata/etcd-snapshot-demo1-cp-nt9h9-qxx28-1730912404"
Nov 06 20:00:04 demo1-cp-nt9h9-qxx28 rke2[2175]: time="2024-11-06T20:00:04+03:00" level=info msg="Applying snapshot retention=5 to local snapshots with prefix etcd-snapshot in /var/lib/rancher/rke2/server/db/snapshots"
Nov 06 20:00:04 demo1-cp-nt9h9-qxx28 rke2[2175]: time="2024-11-06T20:00:04+03:00" level=info msg="Reconciling ETCDSnapshotFile resources"
Nov 06 20:00:04 demo1-cp-nt9h9-qxx28 rke2[2175]: time="2024-11-06T20:00:04+03:00" level=info msg="Reconciliation of ETCDSnapshotFile resources complete"

purple-city-12222

11/06/2024, 8:01 PM

and I am using the default latest , but since I face this issue then I thought it is version issue, I had one previously which I was able to bootstrap a successful cluster on vsphere so I though I should use the same version and see if does the case , so basically I installed latest and 2.9.0 both are the same error

creamy-pencil-82913

11/06/2024, 8:01 PM

is the cluster-agent deployment on that node running?

purple-city-12222

11/06/2024, 8:02 PM

Copy code

root@demo1-cp-nt9h9-qxx28:~# k get pod -A
NAMESPACE       NAME                                                   READY   STATUS      RESTARTS   AGE
cattle-system   cattle-cluster-agent-67bbb75c77-fkdjd                  0/1     Pending     0          16h
kube-system     cilium-5scwg                                           1/1     Running     0          16h
kube-system     cilium-operator-684f86fff5-m5954                       1/1     Running     0          16h
kube-system     cilium-operator-684f86fff5-vfs77                       0/1     Pending     0          16h
kube-system     etcd-demo1-cp-nt9h9-qxx28                              1/1     Running     0          16h
kube-system     helm-install-rancher-vsphere-cpi-4f9bc                 0/1     Completed   0          16h
kube-system     helm-install-rancher-vsphere-csi-cjfw2                 0/1     Completed   0          16h
kube-system     helm-install-rke2-cilium-qtk4c                         0/1     Completed   0          16h
kube-system     helm-install-rke2-coredns-l9fdt                        0/1     Completed   0          16h
kube-system     helm-install-rke2-ingress-nginx-9428h                  0/1     Pending     0          16h
kube-system     helm-install-rke2-metrics-server-6nhcx                 0/1     Pending     0          16h
kube-system     helm-install-rke2-snapshot-controller-crd-p9d2j        0/1     Pending     0          16h
kube-system     helm-install-rke2-snapshot-controller-dl9vs            0/1     Pending     0          16h
kube-system     helm-install-rke2-snapshot-validation-webhook-xvzn4    0/1     Pending     0          16h
kube-system     kube-apiserver-demo1-cp-nt9h9-qxx28                    1/1     Running     0          16h
kube-system     kube-controller-manager-demo1-cp-nt9h9-qxx28           1/1     Running     0          16h
kube-system     kube-proxy-demo1-cp-nt9h9-qxx28                        1/1     Running     0          16h
kube-system     kube-scheduler-demo1-cp-nt9h9-qxx28                    1/1     Running     0          16h
kube-system     rancher-vsphere-cpi-cloud-controller-manager-9r88v     1/1     Running     0          16h
kube-system     rke2-coredns-rke2-coredns-7d8f866c78-5dj5n             0/1     Pending     0          16h
kube-system     rke2-coredns-rke2-coredns-autoscaler-75bc99ff8-s8z2j   0/1     Pending     0          16h
kube-system     vsphere-csi-controller-7dc7858ffc-n9hzz                0/5     Pending     0          16h
kube-system     vsphere-csi-controller-7dc7858ffc-s2m7p                0/5     Pending     0          16h
kube-system     vsphere-csi-controller-7dc7858ffc-z9kkk                0/5     Pending     0          16h
root@demo1-cp-nt9h9-qxx28:~#

creamy-pencil-82913

11/06/2024, 8:02 PM

ok so everything is pending. Describe one of the pending pods to see why.

creamy-pencil-82913

11/06/2024, 8:02 PM

I suspect you have insufficient CPU resources?

purple-city-12222

11/06/2024, 8:03 PM

because the other CP nodes and worker nodes not available (not joined yet) then they will be pending

purple-city-12222

11/06/2024, 8:03 PM

I said I have 3 cp node and 3 workers

purple-city-12222

11/06/2024, 8:03 PM

I have only one cp node in ready stats and joined the cluster

creamy-pencil-82913

11/06/2024, 8:04 PM

you showed the node list and there’s only one node there.

creamy-pencil-82913

11/06/2024, 8:04 PM

Where are the other ones?

creamy-pencil-82913

11/06/2024, 8:04 PM

Check the logs on the other 2 servers to see why they haven’t joined the cluster?

purple-city-12222

11/06/2024, 8:05 PM

because Rancher could not install rke2-server /agent there !

purple-city-12222

11/06/2024, 8:05 PM

so they look like standalone nodes now

creamy-pencil-82913

11/06/2024, 8:06 PM

… why not

creamy-pencil-82913

11/06/2024, 8:07 PM

why did it fail to install rke2 on the other 5 nodes?

purple-city-12222

11/06/2024, 8:07 PM

Copy code

root@demo1-cp-nt9h9-sv7th:~# journalctl -u rancher-system-agent.service
Nov 06 06:24:49 demo1-cp-nt9h9-sv7th systemd[1]: Started rancher-system-agent.service - Rancher System Agent.
Nov 06 06:24:49 demo1-cp-nt9h9-sv7th rancher-system-agent[1963]: time="2024-11-06T06:24:49+03:00" level=info msg="Rancher System Agent version v0.3.7 (bf4eb09) is starting"
Nov 06 06:24:49 demo1-cp-nt9h9-sv7th rancher-system-agent[1963]: time="2024-11-06T06:24:49+03:00" level=info msg="Using directory /var/lib/rancher/agent/work for work"
Nov 06 06:24:49 demo1-cp-nt9h9-sv7th rancher-system-agent[1963]: time="2024-11-06T06:24:49+03:00" level=info msg="Starting remote watch of plans"
Nov 06 06:24:49 demo1-cp-nt9h9-sv7th rancher-system-agent[1963]: time="2024-11-06T06:24:49+03:00" level=info msg="Starting /v1, Kind=Secret controller"
root@demo1-cp-nt9h9-sv7th:~#

purple-city-12222

11/06/2024, 8:07 PM

this one is one if the other cp node ☝️

creamy-pencil-82913

11/06/2024, 8:07 PM

what about the other one

purple-city-12222

11/06/2024, 8:08 PM

the same, let me check one worker...

creamy-pencil-82913

11/06/2024, 8:08 PM

Did you create the cluster with 1 cp/etcd/worker and then scale up to 3 or something?

purple-city-12222

11/06/2024, 8:08 PM

no 3 cp 3 workers

purple-city-12222

11/06/2024, 8:08 PM

Copy code

root@demo1-wk-d6rjr-42s5z:~# journalctl -u rancher-system-agent.service
Nov 06 06:20:03 demo1-wk-d6rjr-42s5z systemd[1]: Started rancher-system-agent.service - Rancher System Agent.
Nov 06 06:20:03 demo1-wk-d6rjr-42s5z rancher-system-agent[1959]: time="2024-11-06T06:20:03+03:00" level=info msg="Rancher System Agent version v0.3.7 (bf4eb09) is starting"
Nov 06 06:20:03 demo1-wk-d6rjr-42s5z rancher-system-agent[1959]: time="2024-11-06T06:20:03+03:00" level=info msg="Using directory /var/lib/rancher/agent/work for work"
Nov 06 06:20:03 demo1-wk-d6rjr-42s5z rancher-system-agent[1959]: time="2024-11-06T06:20:03+03:00" level=info msg="Starting remote watch of plans"
Nov 06 06:20:03 demo1-wk-d6rjr-42s5z rancher-system-agent[1959]: time="2024-11-06T06:20:03+03:00" level=info msg="Starting /v1, Kind=Secret controller"
root@demo1-wk-d6rjr-42s5z:~#

creamy-pencil-82913

11/06/2024, 8:09 PM

yeah it appears that it’s waiting on the first one. But the first one won’t finish coming up until those pending pods can all run. I suspect you’ll need to give it more memory/cpu or whatever it wants, and then restart rke2.

creamy-pencil-82913

11/06/2024, 8:10 PM

What kind of cpu/memory does that first cp+etcd node have?

purple-city-12222

11/06/2024, 8:11 PM

same for both cp and workers

purple-city-12222

11/06/2024, 8:11 PM

4 cpus, 32GB mem, 100GB disk

creamy-pencil-82913

11/06/2024, 8:11 PM

Check to see why the pods are pending. Usually its resources but maybe not.

creamy-pencil-82913

11/06/2024, 8:12 PM

also check the vsphere cpi pod logs to see if there are errors

purple-city-12222

11/06/2024, 8:12 PM

Copy code

root@demo1-cp-nt9h9-qxx28:~# k -n kube-system describe pod/cilium-operator-684f86fff5-vfs77
Name:                 cilium-operator-684f86fff5-vfs77
Namespace:            kube-system
Priority:             2000000000
Priority Class Name:  system-cluster-critical
Service Account:      cilium-operator
Node:                 <none>
Labels:               <http://app.kubernetes.io/name=cilium-operator|app.kubernetes.io/name=cilium-operator>
                      <http://app.kubernetes.io/part-of=cilium|app.kubernetes.io/part-of=cilium>
                      io.cilium/app=operator
                      name=cilium-operator
                      pod-template-hash=684f86fff5
Annotations:          <http://prometheus.io/port|prometheus.io/port>: 9963
                      <http://prometheus.io/scrape|prometheus.io/scrape>: true
Status:               Pending
IP:
IPs:                  <none>
Controlled By:        ReplicaSet/cilium-operator-684f86fff5
Containers:
  cilium-operator:
    Image:      rancher/mirrored-cilium-operator-generic:v1.16.1
    Port:       9963/TCP
    Host Port:  9963/TCP
    Command:
      cilium-operator-generic
    Args:
      --config-dir=/tmp/cilium/config-map
      --debug=$(CILIUM_DEBUG)
    Liveness:   http-get <http://127.0.0.1:9234/healthz> delay=60s timeout=3s period=10s #success=1 #failure=3
    Readiness:  http-get <http://127.0.0.1:9234/healthz> delay=0s timeout=3s period=5s #success=1 #failure=5
    Environment:
      K8S_NODE_NAME:          (v1:spec.nodeName)
      CILIUM_K8S_NAMESPACE:  kube-system (v1:metadata.namespace)
      CILIUM_DEBUG:          <set to the key 'debug' of config map 'cilium-config'>  Optional: true
    Mounts:
      /tmp/cilium/config-map from cilium-config-path (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-drtsh (ro)
Conditions:
  Type           Status
  PodScheduled   False
Volumes:
  cilium-config-path:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      cilium-config
    Optional:  false
  kube-api-access-drtsh:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <http://kubernetes.io/os=linux|kubernetes.io/os=linux>
Tolerations:                 op=Exists
Events:
  Type     Reason            Age                  From               Message
  ----     ------            ----                 ----               -------
  Warning  FailedScheduling  19m (x199 over 16h)  default-scheduler  0/1 nodes are available: 1 node(s) didn't have free ports for the requested pod ports. preemption: 0/1 nodes are available: 1 node(s) didn't have free ports for the requested pod ports.
root@demo1-cp-nt9h9-qxx28:~#

creamy-pencil-82913

11/06/2024, 8:13 PM

not the cilium operator pod. that one is fine, there’s already a replica of that one running

purple-city-12222

11/06/2024, 8:14 PM

Copy code

root@demo1-cp-nt9h9-qxx28:~# k -n kube-system describe pod/vsphere-csi-controller-7dc7858ffc-z9kkk
Name:             vsphere-csi-controller-7dc7858ffc-z9kkk
Namespace:        kube-system
Priority:         0
Service Account:  vsphere-csi-controller
Node:             <none>
Labels:           app=vsphere-csi-controller
                  <http://app.kubernetes.io/managed-by=Helm|app.kubernetes.io/managed-by=Helm>
                  <http://app.kubernetes.io/version=3.3.0-rancher1|app.kubernetes.io/version=3.3.0-rancher1>
                  <http://helm.sh/chart=rancher-vsphere-csi-3.3.0-rancher100|helm.sh/chart=rancher-vsphere-csi-3.3.0-rancher100>
                  pod-template-hash=7dc7858ffc
                  role=vsphere-csi
Annotations:      <none>
Status:           Pending
IP:
IPs:              <none>
Controlled By:    ReplicaSet/vsphere-csi-controller-7dc7858ffc
Containers:
  csi-attacher:
    Image:      rancher/mirrored-sig-storage-csi-attacher:v4.5.1
    Port:       <none>
    Host Port:  <none>
    Args:
      --v=4
      --timeout=300s
      --csi-address=$(ADDRESS)
      --leader-election
      --kube-api-qps=100
      --kube-api-burst=100
    Environment:
      ADDRESS:  /csi/csi.sock
    Mounts:
      /csi from socket-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8hrzq (ro)
  vsphere-csi-controller:
    Image:       rancher/mirrored-cloud-provider-vsphere-csi-release-driver:v3.3.0
    Ports:       9808/TCP, 2112/TCP
    Host Ports:  0/TCP, 0/TCP
    Args:
      --fss-name=<http://internal-feature-states.csi.vsphere.vmware.com|internal-feature-states.csi.vsphere.vmware.com>
      --fss-namespace=$(CSI_NAMESPACE)
    Liveness:  http-get http://:healthz/healthz delay=10s timeout=3s period=5s #success=1 #failure=3
    Environment:
      CSI_ENDPOINT:                     unix:///csi/csi.sock
      X_CSI_MODE:                       controller
      X_CSI_SPEC_DISABLE_LEN_CHECK:     true
      X_CSI_SERIAL_VOL_ACCESS_TIMEOUT:  3m
      VSPHERE_CSI_CONFIG:               /etc/cloud/csi-vsphere.conf
      LOGGER_LEVEL:                     PRODUCTION
      INCLUSTER_CLIENT_QPS:             100
      INCLUSTER_CLIENT_BURST:           100
      CSI_NAMESPACE:                    kube-system (v1:metadata.namespace)
    Mounts:
      /csi from socket-dir (rw)
      /etc/cloud from vsphere-config-volume (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8hrzq (ro)
  liveness-probe:
    Image:      rancher/mirrored-sig-storage-livenessprobe:v2.12.0
    Port:       <none>
    Host Port:  <none>
    Args:
      --v=4
      --csi-address=/csi/csi.sock
    Environment:  <none>
    Mounts:
      /csi from socket-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8hrzq (ro)
  vsphere-syncer:
    Image:      rancher/mirrored-cloud-provider-vsphere-csi-release-syncer:v3.3.0
    Port:       2113/TCP
    Host Port:  0/TCP
    Args:
      --leader-election
      --fss-name=<http://internal-feature-states.csi.vsphere.vmware.com|internal-feature-states.csi.vsphere.vmware.com>
      --fss-namespace=$(CSI_NAMESPACE)
    Environment:
      FULL_SYNC_INTERVAL_MINUTES:  30
      VSPHERE_CSI_CONFIG:          /etc/cloud/csi-vsphere.conf
      LOGGER_LEVEL:                PRODUCTION
      INCLUSTER_CLIENT_QPS:        100
      INCLUSTER_CLIENT_BURST:      100
      CSI_NAMESPACE:               kube-system (v1:metadata.namespace)
    Mounts:
      /etc/cloud from vsphere-config-volume (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8hrzq (ro)
  csi-provisioner:
    Image:      rancher/mirrored-sig-storage-csi-provisioner:v4.0.1
    Port:       <none>
    Host Port:  <none>
    Args:
      --v=4
      --timeout=300s
      --csi-address=$(ADDRESS)
      --kube-api-qps=100
      --kube-api-burst=100
      --leader-election
      --default-fstype=ext4
    Environment:
      ADDRESS:  /csi/csi.sock
    Mounts:
      /csi from socket-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8hrzq (ro)
Conditions:
  Type           Status
  PodScheduled   False
Volumes:
  vsphere-config-volume:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  vsphere-config-secret
    Optional:    false
  socket-dir:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  kube-api-access-8hrzq:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 <http://node-role.kubernetes.io/control-plane:NoSchedule|node-role.kubernetes.io/control-plane:NoSchedule> op=Exists
                             <http://node-role.kubernetes.io/controlplane=true:NoSchedule|node-role.kubernetes.io/controlplane=true:NoSchedule>
                             <http://node-role.kubernetes.io/etcd:NoExecute|node-role.kubernetes.io/etcd:NoExecute> op=Exists
                             <http://node-role.kubernetes.io/master:NoSchedule|node-role.kubernetes.io/master:NoSchedule> op=Exists
                             <http://node.kubernetes.io/not-ready:NoExecute|node.kubernetes.io/not-ready:NoExecute> op=Exists for 300s
                             <http://node.kubernetes.io/unreachable:NoExecute|node.kubernetes.io/unreachable:NoExecute> op=Exists for 300s
Events:
  Type     Reason            Age                  From               Message
  ----     ------            ----                 ----               -------
  Warning  FailedScheduling  22m (x197 over 16h)  default-scheduler  0/1 nodes are available: 1 node(s) had untolerated taint {<http://node.cloudprovider.kubernetes.io/uninitialized|node.cloudprovider.kubernetes.io/uninitialized>: true}. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling.
root@demo1-cp-nt9h9-qxx28:~#

creamy-pencil-82913

11/06/2024, 8:14 PM

check the cluster-agent and coredns pods. Also the vsphere cloud controller pod logs. I suspect that one will have errors in it, which is why the node is still tainted uninitialized.

creamy-pencil-82913

11/06/2024, 8:14 PM

yep. vsphere isn’t configured right. check the vsphere cpi pod logs

purple-city-12222

11/06/2024, 8:15 PM

Copy code

root@demo1-cp-nt9h9-qxx28:~# k -n cattle-system describe pod/cattle-cluster-agent-67bbb75c77-fkdjd
Name:             cattle-cluster-agent-67bbb75c77-fkdjd
Namespace:        cattle-system
Priority:         0
Service Account:  cattle
Node:             <none>
Labels:           app=cattle-cluster-agent
                  pod-template-hash=67bbb75c77
Annotations:      <none>
Status:           Pending
IP:
IPs:              <none>
Controlled By:    ReplicaSet/cattle-cluster-agent-67bbb75c77
Containers:
  cluster-register:
    Image:      rancher/rancher-agent:v2.9.0
    Port:       <none>
    Host Port:  <none>
    Environment:
      CATTLE_FEATURES:           embedded-cluster-api=false,fleet=false,multi-cluster-management=false,multi-cluster-management-agent=true,provisioningv2=false,rke2=false,ui-sql-cache=false
      CATTLE_IS_RKE:             false
      CATTLE_SERVER:             <https://172.16.9.11.sslip.io>
      CATTLE_CA_CHECKSUM:        a14e4b6838604b9aa087b33958dee0a1fafd944c1d024114fb1eab4d89748a2a
      CATTLE_CLUSTER:            true
      CATTLE_K8S_MANAGED:        true
      CATTLE_CLUSTER_REGISTRY:
      CATTLE_SERVER_VERSION:     v2.9.0
      CATTLE_INSTALL_UUID:       f2801916-7c2e-45dc-a119-effce68a3ed5
      CATTLE_INGRESS_IP_DOMAIN:  <http://sslip.io|sslip.io>
      STRICT_VERIFY:             false
    Mounts:
      /cattle-credentials from cattle-credentials (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-9dq92 (ro)
Conditions:
  Type           Status
  PodScheduled   False
Volumes:
  cattle-credentials:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  cattle-credentials-d7916af
    Optional:    false
  kube-api-access-9dq92:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 <http://node-role.kubernetes.io/control-plane:NoSchedule|node-role.kubernetes.io/control-plane:NoSchedule>
                             <http://node-role.kubernetes.io/etcd:NoExecute|node-role.kubernetes.io/etcd:NoExecute>
                             <http://node.kubernetes.io/not-ready:NoExecute|node.kubernetes.io/not-ready:NoExecute> op=Exists for 300s
                             <http://node.kubernetes.io/unreachable:NoExecute|node.kubernetes.io/unreachable:NoExecute> op=Exists for 300s
Events:
  Type     Reason            Age                  From               Message
  ----     ------            ----                 ----               -------
  Warning  FailedScheduling  23m (x197 over 16h)  default-scheduler  0/1 nodes are available: 1 node(s) had untolerated taint {<http://node.cloudprovider.kubernetes.io/uninitialized|node.cloudprovider.kubernetes.io/uninitialized>: true}. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling.
root@demo1-cp-nt9h9-qxx28:~#

creamy-pencil-82913

11/06/2024, 8:16 PM

yep they’re all pending because the vsphere cpi hasn’t untainted the node yet

purple-city-12222

11/06/2024, 8:16 PM

so what should I do in this situation ?

creamy-pencil-82913

11/06/2024, 8:17 PM

https://rancher-users.slack.com/archives/C3ASABBD1/p1730924097798139?thread_ts=1730921411.821809&cid=C3ASABBD1

creamy-pencil-82913

11/06/2024, 8:18 PM

continue working to figure out whats wrong with your vsphere stuff?

purple-city-12222

11/06/2024, 8:21 PM

but why it is working when I deploy the cluster on vsphere with the exactly the same config as the current and all is good ?!!! now I have 2 same version Rancher 2.9.0 one can deploy cluster on vsphere with no issue , the other one can't !!!

creamy-pencil-82913

11/06/2024, 8:24 PM

why don’t you check the logs and find out?

purple-city-12222

11/06/2024, 9:01 PM

right, I check the logs and couldn't find out, that's why I posted here 😞

creamy-pencil-82913

11/06/2024, 9:03 PM

what’s in the log on the

rancher-vsphere-cpi-cloud-controller-manager-9r88v

pod

creamy-pencil-82913

11/06/2024, 9:03 PM

you have not showed anything about that yet

purple-city-12222

11/06/2024, 9:07 PM

Copy code

E1106 20:58:39.699338       1 node_controller.go:240] error syncing 'demo1-cp-nt9h9-qxx28': failed to get instance metadata for node demo1-cp-nt9h9-qxx28: failed to get zone from cloud provider: Zone: Error fetching by providerID: vSphere region category k8s-region does not match any tags for mo: VirtualMachine:vm-1176 Error fetching by NodeName: vSphere region category k8s-region does not match any tags for mo: VirtualMachine:vm-1176, requeuing
I1106 21:00:00.851780       1 node_controller.go:425] Initializing node demo1-cp-nt9h9-qxx28 with cloud provider
I1106 21:00:00.851831       1 instances.go:102] instances.InstanceID() CACHED with demo1-cp-nt9h9-qxx28
I1106 21:00:00.851849       1 search.go:76] WhichVCandDCByNodeID nodeID: 42291d12-2fc2-de85-20e6-ffb494ee6e4f
I1106 21:00:00.873631       1 search.go:208] Found node 42291d12-2fc2-de85-20e6-ffb494ee6e4f as vm=VirtualMachine:vm-1176 in vc=192.168.200.200 and datacenter=My_Datacenter
I1106 21:00:00.873668       1 search.go:210] Hostname: demo1-cp-nt9h9-qxx28, UUID: 42291d12-2fc2-de85-20e6-ffb494ee6e4f
I1106 21:00:00.873696       1 nodemanager.go:168] Discovered VM using normal UUID format
I1106 21:00:00.883218       1 nodemanager.go:277] Adding Hostname: demo1-cp-nt9h9-qxx28
I1106 21:00:00.883267       1 nodemanager.go:449] Adding Internal IP: 172.16.9.65
I1106 21:00:00.883278       1 nodemanager.go:454] Adding External IP: 172.16.9.65
I1106 21:00:00.883299       1 nodemanager.go:358] Found node 42291d12-2fc2-de85-20e6-ffb494ee6e4f as vm=VirtualMachine:vm-1176 in vc=192.168.200.200 and datacenter=My_Datacenter
I1106 21:00:00.883320       1 nodemanager.go:360] Hostname: demo1-cp-nt9h9-qxx28 UUID: 42291d12-2fc2-de85-20e6-ffb494ee6e4f
I1106 21:00:00.883336       1 instances.go:77] instances.NodeAddressesByProviderID() FOUND with 42291d12-2fc2-de85-20e6-ffb494ee6e4f
E1106 21:00:01.371224       1 zones.go:394] Get zone for mo: HostSystem:host-1012: vSphere region category k8s-region does not match any tags for mo: HostSystem:host-1012
E1106 21:00:01.838151       1 zones.go:394] Get zone for mo: ResourcePool:resgroup-1007: vSphere region category k8s-region does not match any tags for mo: ResourcePool:resgroup-1007
E1106 21:00:02.664192       1 zones.go:394] Get zone for mo: VirtualMachine:vm-1176: vSphere region category k8s-region does not match any tags for mo: VirtualMachine:vm-1176
E1106 21:00:02.664242       1 zones.go:224] Failed to get host system properties. err: vSphere region category k8s-region does not match any tags for mo: VirtualMachine:vm-1176
E1106 21:00:03.484210       1 zones.go:394] Get zone for mo: HostSystem:host-1012: vSphere region category k8s-region does not match any tags for mo: HostSystem:host-1012
E1106 21:00:04.081031       1 zones.go:394] Get zone for mo: ResourcePool:resgroup-1007: vSphere region category k8s-region does not match any tags for mo: ResourcePool:resgroup-1007
E1106 21:00:04.662106       1 zones.go:394] Get zone for mo: VirtualMachine:vm-1176: vSphere region category k8s-region does not match any tags for mo: VirtualMachine:vm-1176
E1106 21:00:04.662130       1 zones.go:153] Failed to get host system properties. err: vSphere region category k8s-region does not match any tags for mo: VirtualMachine:vm-1176
I1106 21:00:04.662177       1 node_controller.go:229] error syncing 'demo1-cp-nt9h9-qxx28': failed to get instance metadata for node demo1-cp-nt9h9-qxx28: failed to get zone from cloud provider: Zone: Error fetching by providerID: vSphere region category k8s-region does not match any tags for mo: VirtualMachine:vm-1176 Error fetching by NodeName: vSphere region category k8s-region does not match any tags for mo: VirtualMachine:vm-1176, requeuing
E1106 21:00:04.662200       1 node_controller.go:240] error syncing 'demo1-cp-nt9h9-qxx28': failed to get instance metadata for node demo1-cp-nt9h9-qxx28: failed to get zone from cloud provider: Zone: Error fetching by providerID: vSphere region category k8s-region does not match any tags for mo: VirtualMachine:vm-1176 Error fetching by NodeName: vSphere region category k8s-region does not match any tags for mo: VirtualMachine:vm-1176, requeuing
I1106 21:01:37.136563       1 node_controller.go:425] Initializing node demo1-cp-nt9h9-qxx28 with cloud provider
I1106 21:01:37.136619       1 instances.go:102] instances.InstanceID() CACHED with demo1-cp-nt9h9-qxx28
I1106 21:01:37.136637       1 search.go:76] WhichVCandDCByNodeID nodeID: 42291d12-2fc2-de85-20e6-ffb494ee6e4f
I1106 21:01:37.172053       1 search.go:208] Found node 42291d12-2fc2-de85-20e6-ffb494ee6e4f as vm=VirtualMachine:vm-1176 in vc=192.168.200.200 and datacenter=My_Datacenter
I1106 21:01:37.172092       1 search.go:210] Hostname: demo1-cp-nt9h9-qxx28, UUID: 42291d12-2fc2-de85-20e6-ffb494ee6e4f
I1106 21:01:37.172121       1 nodemanager.go:168] Discovered VM using normal UUID format
I1106 21:01:37.182398       1 nodemanager.go:277] Adding Hostname: demo1-cp-nt9h9-qxx28
I1106 21:01:37.182457       1 nodemanager.go:449] Adding Internal IP: 172.16.9.65
I1106 21:01:37.182469       1 nodemanager.go:454] Adding External IP: 172.16.9.65
I1106 21:01:37.182488       1 nodemanager.go:358] Found node 42291d12-2fc2-de85-20e6-ffb494ee6e4f as vm=VirtualMachine:vm-1176 in vc=192.168.200.200 and datacenter=My_Datacenter
I1106 21:01:37.182516       1 nodemanager.go:360] Hostname: demo1-cp-nt9h9-qxx28 UUID: 42291d12-2fc2-de85-20e6-ffb494ee6e4f
I1106 21:01:37.182530       1 instances.go:77] instances.NodeAddressesByProviderID() FOUND with 42291d12-2fc2-de85-20e6-ffb494ee6e4f
E1106 21:01:37.821271       1 zones.go:394] Get zone for mo: HostSystem:host-1012: vSphere region category k8s-region does not match any tags for mo: HostSystem:host-1012
E1106 21:01:38.381515       1 zones.go:394] Get zone for mo: ResourcePool:resgroup-1007: vSphere region category k8s-region does not match any tags for mo: ResourcePool:resgroup-1007
E1106 21:01:38.854240       1 zones.go:394] Get zone for mo: VirtualMachine:vm-1176: vSphere region category k8s-region does not match any tags for mo: VirtualMachine:vm-1176
E1106 21:01:38.854274       1 zones.go:224] Failed to get host system properties. err: vSphere region category k8s-region does not match any tags for mo: VirtualMachine:vm-1176
E1106 21:01:39.344343       1 zones.go:394] Get zone for mo: HostSystem:host-1012: vSphere region category k8s-region does not match any tags for mo: HostSystem:host-1012
E1106 21:01:39.783754       1 zones.go:394] Get zone for mo: ResourcePool:resgroup-1007: vSphere region category k8s-region does not match any tags for mo: ResourcePool:resgroup-1007
E1106 21:01:40.236813       1 zones.go:394] Get zone for mo: VirtualMachine:vm-1176: vSphere region category k8s-region does not match any tags for mo: VirtualMachine:vm-1176
E1106 21:01:40.236846       1 zones.go:153] Failed to get host system properties. err: vSphere region category k8s-region does not match any tags for mo: VirtualMachine:vm-1176
I1106 21:01:40.236881       1 node_controller.go:229] error syncing 'demo1-cp-nt9h9-qxx28': failed to get instance metadata for node demo1-cp-nt9h9-qxx28: failed to get zone from cloud provider: Zone: Error fetching by providerID: vSphere region category k8s-region does not match any tags for mo: VirtualMachine:vm-1176 Error fetching by NodeName: vSphere region category k8s-region does not match any tags for mo: VirtualMachine:vm-1176, requeuing
E1106 21:01:40.236899       1 node_controller.go:240] error syncing 'demo1-cp-nt9h9-qxx28': failed to get instance metadata for node demo1-cp-nt9h9-qxx28: failed to get zone from cloud provider: Zone: Error fetching by providerID: vSphere region category k8s-region does not match any tags for mo: VirtualMachine:vm-1176 Error fetching by NodeName: vSphere region category k8s-region does not match any tags for mo: VirtualMachine:vm-1176, requeuing
I1106 21:01:59.757764       1 node_controller.go:267] Update 1 nodes status took 75.736µs.
I1106 21:03:43.918119       1 node_controller.go:425] Initializing node demo1-cp-nt9h9-qxx28 with cloud provider
I1106 21:03:43.919408       1 instances.go:102] instances.InstanceID() CACHED with demo1-cp-nt9h9-qxx28
I1106 21:03:43.919431       1 search.go:76] WhichVCandDCByNodeID nodeID: 42291d12-2fc2-de85-20e6-ffb494ee6e4f
I1106 21:03:43.951455       1 search.go:208] Found node 42291d12-2fc2-de85-20e6-ffb494ee6e4f as vm=VirtualMachine:vm-1176 in vc=192.168.200.200 and datacenter=My_Datacenter
I1106 21:03:43.951493       1 search.go:210] Hostname: demo1-cp-nt9h9-qxx28, UUID: 42291d12-2fc2-de85-20e6-ffb494ee6e4f
I1106 21:03:43.951522       1 nodemanager.go:168] Discovered VM using normal UUID format
I1106 21:03:43.961245       1 nodemanager.go:277] Adding Hostname: demo1-cp-nt9h9-qxx28
I1106 21:03:43.961296       1 nodemanager.go:449] Adding Internal IP: 172.16.9.65
I1106 21:03:43.961308       1 nodemanager.go:454] Adding External IP: 172.16.9.65
I1106 21:03:43.961328       1 nodemanager.go:358] Found node 42291d12-2fc2-de85-20e6-ffb494ee6e4f as vm=VirtualMachine:vm-1176 in vc=192.168.200.200 and datacenter=My_Datacenter
I1106 21:03:43.961350       1 nodemanager.go:360] Hostname: demo1-cp-nt9h9-qxx28 UUID: 42291d12-2fc2-de85-20e6-ffb494ee6e4f
I1106 21:03:43.961365       1 instances.go:77] instances.NodeAddressesByProviderID() FOUND with 42291d12-2fc2-de85-20e6-ffb494ee6e4f
E1106 21:03:44.455823       1 zones.go:394] Get zone for mo: HostSystem:host-1012: vSphere region category k8s-region does not match any tags for mo: HostSystem:host-1012
E1106 21:03:44.885310       1 zones.go:394] Get zone for mo: ResourcePool:resgroup-1007: vSphere region category k8s-region does not match any tags for mo: ResourcePool:resgroup-1007
E1106 21:03:45.336458       1 zones.go:394] Get zone for mo: VirtualMachine:vm-1176: vSphere region category k8s-region does not match any tags for mo: VirtualMachine:vm-1176
E1106 21:03:45.336491       1 zones.go:224] Failed to get host system properties. err: vSphere region category k8s-region does not match any tags for mo: VirtualMachine:vm-1176
E1106 21:03:45.919046       1 zones.go:394] Get zone for mo: HostSystem:host-1012: vSphere region category k8s-region does not match any tags for mo: HostSystem:host-1012
E1106 21:03:46.338658       1 zones.go:394] Get zone for mo: ResourcePool:resgroup-1007: vSphere region category k8s-region does not match any tags for mo: ResourcePool:resgroup-1007
E1106 21:03:46.840754       1 zones.go:394] Get zone for mo: VirtualMachine:vm-1176: vSphere region category k8s-region does not match any tags for mo: VirtualMachine:vm-1176
E1106 21:03:46.840790       1 zones.go:153] Failed to get host system properties. err: vSphere region category k8s-region does not match any tags for mo: VirtualMachine:vm-1176
I1106 21:03:46.840868       1 node_controller.go:229] error syncing 'demo1-cp-nt9h9-qxx28': failed to get instance metadata for node demo1-cp-nt9h9-qxx28: failed to get zone from cloud provider: Zone: Error fetching by providerID: vSphere region category k8s-region does not match any tags for mo: VirtualMachine:vm-1176 Error fetching by NodeName: vSphere region category k8s-region does not match any tags for mo: VirtualMachine:vm-1176, requeuing
E1106 21:03:46.840912       1 node_controller.go:240] error syncing 'demo1-cp-nt9h9-qxx28': failed to get instance metadata for node demo1-cp-nt9h9-qxx28: failed to get zone from cloud provider: Zone: Error fetching by providerID: vSphere region category k8s-region does not match any tags for mo: VirtualMachine:vm-1176 Error fetching by NodeName: vSphere region category k8s-region does not match any tags for mo: VirtualMachine:vm-1176, requeuing

creamy-pencil-82913

11/06/2024, 9:08 PM

There you go. Fix that.

purple-city-12222

11/06/2024, 9:10 PM

right, so should I disable this on the UI ?

creamy-pencil-82913

11/06/2024, 9:10 PM

idk, it depends on how you’ve set up vsphere. If you’re not going to tag your VM hosts then you probably shouldn’t tell vsphere to expect region and zone tags.

purple-city-12222

11/06/2024, 9:12 PM

right, it seems that was the issue, now I can see nodes are joining the cluster...

creamy-pencil-82913

11/06/2024, 9:12 PM

yep! that was why I was suggesting you check the pod logs!

creamy-pencil-82913

11/06/2024, 9:13 PM

So this was probably something you were doing differently across the two Rancher servers? Nothing to do with the version or anything else?

purple-city-12222

11/06/2024, 9:13 PM

most likely

creamy-pencil-82913

11/06/2024, 9:13 PM

This is all vsphere stuff. It is documented in their project: https://cloud-provider-vsphere.sigs.k8s.io/tutorials/deploying_cpi_with_multi_dc_vc_aka_zones

creamy-pencil-82913

11/06/2024, 9:14 PM

See the

Creating Zones in your vSphere Environment via Tags

section

purple-city-12222

11/06/2024, 9:16 PM

amazing, you are right , thank you Brandon for your help 🙏 You saved my day , appreciated

purple-city-12222

11/06/2024, 9:17 PM

they are all in Ready stats , I will try to use the latest version of Rancher and will give it a try

👍 1

purple-city-12222

11/06/2024, 9:17 PM

Thanks again !

purple-city-12222

11/07/2024, 2:51 PM

Hi Brandon , quick question please, I deleted this cluster and I created new one without enabling the vSphere tags (Region and Zone) again I have the same issue:

purple-city-12222

11/07/2024, 2:54 PM

no logs...

creamy-pencil-82913

11/07/2024, 7:11 PM

do the same thing man

creamy-pencil-82913

11/07/2024, 7:12 PM

look at the pending pods to see why they’re pending, look at the vsphere cpi pod logs to see what you did wrong if the pending pods are pending because the node is tainted uninitialized

purple-city-12222

11/07/2024, 7:34 PM

even if it is running, because this time cpi is running, still I should check the logs, right ?

creamy-pencil-82913

11/07/2024, 7:35 PM

the cpi was running last time, wasn’t it? It just had errors in its log.

creamy-pencil-82913

11/07/2024, 7:35 PM

why would you not check the logs? Whats the downside?

purple-city-12222

11/07/2024, 7:36 PM

you are absolutely right, but since the cpi is running and the others is in pending mode, so there is no log for the pending ones , alright I will check the cpi logs

purple-city-12222

11/08/2024, 2:43 AM

• alright, I found the issue, actually it is a rancher bug, when you create cluster using rancher ui, and when you chose the cloud provider

vsphere

the on the left side there is: • Add-on: vSphere CPI • Add-on: vSphere CSI which you need to provide the vsphere data there, including credentials, when you add credentials even if you enable the option

Generate Credential's Secret

it will not generate it for both cpi amd csi, and this will cause problem for CPI as well as CSI and as a result cluster will not start properly. so I tried it twice it is true that rancher does not generate secret for cpi and csi, so you should go back to ui and

Edit Config

for the cluster and go the • Add-on: vSphere CPI • Add-on: vSphere CSI section and re-enter the credentials and make sure the check box for

Generate Credential's Secret

is on and after that the remaining pending pods will be initialized and running and the pending nodes will join the cluster.

purple-city-12222

11/08/2024, 2:44 AM

Probably I should create an issue on Github for this matter

130 Views

Open in Slack

Previous Next