This message was deleted.
# general
a
This message was deleted.
c
1. Your message is unclear. These logs appear to be from a cluster node that rancher has provisioned, not the cluster that rancher is running on? What exactly is failing here? 2. Rancher 2.9.0 is from July. Is there any reason you’re not using 2.9.3?
p
okay, the log is belong to the rancher UI not the cluster which is provisioned (un-complete) using Rancher , I just logged in again and here are the latest logs for rancher ui :
Copy code
2024/11/06 19:29:01 [ERROR] error syncing '_all_': handler user-controllers-controller: userControllersController: failed to set peers for key _all_: failed to start user controllers for cluster c-m-ssbpz56l: ClusterUnavailable 503: cluster not found, requeuing
2024/11/06 19:30:34 [ERROR] error syncing 'c-m-ssbpz56l': handler cluster-deploy: cluster context c-m-ssbpz56l is unavailable, requeuing
2024/11/06 19:30:34 [INFO] [planner] rkecluster fleet-default/demo1: configuring bootstrap node(s) demo1-cp-nt9h9-qxx28: waiting for cluster agent to connect
2024/11/06 19:31:01 [ERROR] error syncing '_all_': handler user-controllers-controller: userControllersController: failed to set peers for key _all_: failed to start user controllers for cluster c-m-ssbpz56l: ClusterUnavailable 503: cluster not found, requeuing
2024/11/06 19:32:34 [ERROR] error syncing 'c-m-ssbpz56l': handler cluster-deploy: cluster context c-m-ssbpz56l is unavailable, requeuing
2024/11/06 19:32:34 [INFO] [planner] rkecluster fleet-default/demo1: configuring bootstrap node(s) demo1-cp-nt9h9-qxx28: waiting for cluster agent to connect
2024/11/06 19:33:01 [ERROR] error syncing '_all_': handler user-controllers-controller: userControllersController: failed to set peers for key _all_: failed to start user controllers for cluster c-m-ssbpz56l: ClusterUnavailable 503: cluster not found, requeuing
2024/11/06 19:34:34 [ERROR] error syncing 'c-m-ssbpz56l': handler cluster-deploy: cluster context c-m-ssbpz56l is unavailable, requeuing
2024/11/06 19:34:34 [INFO] [planner] rkecluster fleet-default/demo1: configuring bootstrap node(s) demo1-cp-nt9h9-qxx28: waiting for cluster agent to connect
2024/11/06 19:35:01 [ERROR] error syncing '_all_': handler user-controllers-controller: userControllersController: failed to set peers for key _all_: failed to start user controllers for cluster c-m-ssbpz56l: ClusterUnavailable 503: cluster not found, requeuing
2024/11/06 19:36:34 [ERROR] error syncing 'c-m-ssbpz56l': handler cluster-deploy: cluster context c-m-ssbpz56l is unavailable, requeuing
2024/11/06 19:36:34 [INFO] [planner] rkecluster fleet-default/demo1: configuring bootstrap node(s) demo1-cp-nt9h9-qxx28: waiting for cluster agent to connect
2024/11/06 19:37:01 [ERROR] error syncing '_all_': handler user-controllers-controller: userControllersController: failed to set peers for key _all_: failed to start user controllers for cluster c-m-ssbpz56l: ClusterUnavailable 503: cluster not found, requeuing
2024/11/06 19:38:07 [INFO] [planner] rkecluster fleet-default/demo1: configuring bootstrap node(s) demo1-cp-nt9h9-qxx28: waiting for cluster agent to connect
2024/11/06 19:38:34 [ERROR] error syncing 'c-m-ssbpz56l': handler cluster-deploy: cluster context c-m-ssbpz56l is unavailable, requeuing
2024/11/06 19:38:34 [INFO] [planner] rkecluster fleet-default/demo1: configuring bootstrap node(s) demo1-cp-nt9h9-qxx28: waiting for cluster agent to connect
2024/11/06 19:39:01 [ERROR] error syncing '_all_': handler user-controllers-controller: userControllersController: failed to set peers for key _all_: failed to start user controllers for cluster c-m-ssbpz56l: ClusterUnavailable 503: cluster not found, requeuing
2024/11/06 19:40:35 [ERROR] error syncing 'c-m-ssbpz56l': handler cluster-deploy: cluster context c-m-ssbpz56l is unavailable, requeuing
2024/11/06 19:40:35 [INFO] [planner] rkecluster fleet-default/demo1: configuring bootstrap node(s) demo1-cp-nt9h9-qxx28: waiting for cluster agent to connect
2024/11/06 19:41:01 [ERROR] error syncing '_all_': handler user-controllers-controller: userControllersController: failed to set peers for key _all_: failed to start user controllers for cluster c-m-ssbpz56l: ClusterUnavailable 503: cluster not found, requeuing
2024/11/06 19:42:35 [ERROR] error syncing 'c-m-ssbpz56l': handler cluster-deploy: cluster context c-m-ssbpz56l is unavailable, requeuing
2024/11/06 19:42:35 [INFO] [planner] rkecluster fleet-default/demo1: configuring bootstrap node(s) demo1-cp-nt9h9-qxx28: waiting for cluster agent to connect
2024/11/06 19:43:01 [ERROR] error syncing '_all_': handler user-controllers-controller: userControllersController: failed to set peers for key _all_: failed to start user controllers for cluster c-m-ssbpz56l: ClusterUnavailable 503: cluster not found, requeuing
2024/11/06 19:44:35 [ERROR] error syncing 'c-m-ssbpz56l': handler cluster-deploy: cluster context c-m-ssbpz56l is unavailable, requeuing
2024/11/06 19:44:35 [INFO] [planner] rkecluster fleet-default/demo1: configuring bootstrap node(s) demo1-cp-nt9h9-qxx28: waiting for cluster agent to connect
2024/11/06 19:45:01 [ERROR] error syncing '_all_': handler user-controllers-controller: userControllersController: failed to set peers for key _all_: failed to start user controllers for cluster c-m-ssbpz56l: ClusterUnavailable 503: cluster not found, requeuing
2024/11/06 19:46:35 [ERROR] error syncing 'c-m-ssbpz56l': handler cluster-deploy: cluster context c-m-ssbpz56l is unavailable, requeuing
2024/11/06 19:46:35 [INFO] [planner] rkecluster fleet-default/demo1: configuring bootstrap node(s) demo1-cp-nt9h9-qxx28: waiting for cluster agent to connect
2024/11/06 19:47:01 [ERROR] error syncing '_all_': handler user-controllers-controller: userControllersController: failed to set peers for key _all_: failed to start user controllers for cluster c-m-ssbpz56l: ClusterUnavailable 503: cluster not found, requeuing
2024/11/06 19:48:08 [INFO] [planner] rkecluster fleet-default/demo1: configuring bootstrap node(s) demo1-cp-nt9h9-qxx28: waiting for cluster agent to connect
2024/11/06 19:48:35 [ERROR] error syncing 'c-m-ssbpz56l': handler cluster-deploy: cluster context c-m-ssbpz56l is unavailable, requeuing
2024/11/06 19:48:35 [INFO] [planner] rkecluster fleet-default/demo1: configuring bootstrap node(s) demo1-cp-nt9h9-qxx28: waiting for cluster agent to connect
2024/11/06 19:49:01 [ERROR] error syncing '_all_': handler user-controllers-controller: userControllersController: failed to set peers for key _all_: failed to start user controllers for cluster c-m-ssbpz56l: ClusterUnavailable 503: cluster not found, requeuing
2024/11/06 19:50:35 [ERROR] error syncing 'c-m-ssbpz56l': handler cluster-deploy: cluster context c-m-ssbpz56l is unavailable, requeuing
2024/11/06 19:50:35 [INFO] [planner] rkecluster fleet-default/demo1: configuring bootstrap node(s) demo1-cp-nt9h9-qxx28: waiting for cluster agent to connect
2024/11/06 19:50:49 [INFO] Purged 1 expired tokens
2024/11/06 19:51:01 [ERROR] error syncing '_all_': handler user-controllers-controller: userControllersController: failed to set peers for key _all_: failed to start user controllers for cluster c-m-ssbpz56l: ClusterUnavailable 503: cluster not found, requeuing
2024/11/06 19:52:35 [ERROR] error syncing 'c-m-ssbpz56l': handler cluster-deploy: cluster context c-m-ssbpz56l is unavailable, requeuing
2024/11/06 19:52:35 [INFO] [planner] rkecluster fleet-default/demo1: configuring bootstrap node(s) demo1-cp-nt9h9-qxx28: waiting for cluster agent to connect
2024/11/06 19:53:01 [ERROR] error syncing '_all_': handler user-controllers-controller: userControllersController: failed to set peers for key _all_: failed to start user controllers for cluster c-m-ssbpz56l: ClusterUnavailable 503: cluster not found, requeuing
2024/11/06 19:54:35 [ERROR] error syncing 'c-m-ssbpz56l': handler cluster-deploy: cluster context c-m-ssbpz56l is unavailable, requeuing
2024/11/06 19:54:35 [INFO] [planner] rkecluster fleet-default/demo1: configuring bootstrap node(s) demo1-cp-nt9h9-qxx28: waiting for cluster agent to connect
2024/11/06 19:55:01 [ERROR] error syncing '_all_': handler user-controllers-controller: userControllersController: failed to set peers for key _all_: failed to start user controllers for cluster c-m-ssbpz56l: ClusterUnavailable 503: cluster not found, requeuing
2024/11/06 19:56:35 [ERROR] error syncing 'c-m-ssbpz56l': handler cluster-deploy: cluster context c-m-ssbpz56l is unavailable, requeuing
2024/11/06 19:56:35 [INFO] [planner] rkecluster fleet-default/demo1: configuring bootstrap node(s) demo1-cp-nt9h9-qxx28: waiting for cluster agent to connect
and here is the one of the 3 cp nodes:
Copy code
root@demo1-cp-nt9h9-qxx28:~# k get node
NAME                   STATUS   ROLES                       AGE   VERSION
demo1-cp-nt9h9-qxx28   Ready    control-plane,etcd,master   16h   v1.30.5+rke2r1
Copy code
root@demo1-cp-nt9h9-qxx28:~# journalctl -u rke2-server.service -f
Nov 06 20:00:04 demo1-cp-nt9h9-qxx28 rke2[2175]: {"level":"info","ts":"2024-11-06T20:00:04.231606+0300","logger":"etcd-client","caller":"snapshot/v3_snapshot.go:65","msg":"created temporary db file","path":"/var/lib/rancher/rke2/server/db/snapshots/etcd-snapshot-demo1-cp-nt9h9-qxx28-1730912404.part"}
Nov 06 20:00:04 demo1-cp-nt9h9-qxx28 rke2[2175]: {"level":"info","ts":"2024-11-06T20:00:04.235744+0300","logger":"etcd-client.client","caller":"v3@v3.5.13-k3s1/maintenance.go:212","msg":"opened snapshot stream; downloading"}
Nov 06 20:00:04 demo1-cp-nt9h9-qxx28 rke2[2175]: {"level":"info","ts":"2024-11-06T20:00:04.235802+0300","logger":"etcd-client","caller":"snapshot/v3_snapshot.go:73","msg":"fetching snapshot","endpoint":"<https://127.0.0.1:2379>"}
Nov 06 20:00:04 demo1-cp-nt9h9-qxx28 rke2[2175]: {"level":"info","ts":"2024-11-06T20:00:04.35526+0300","logger":"etcd-client.client","caller":"v3@v3.5.13-k3s1/maintenance.go:220","msg":"completed snapshot read; closing"}
Nov 06 20:00:04 demo1-cp-nt9h9-qxx28 rke2[2175]: {"level":"info","ts":"2024-11-06T20:00:04.40446+0300","logger":"etcd-client","caller":"snapshot/v3_snapshot.go:88","msg":"fetched snapshot","endpoint":"<https://127.0.0.1:2379>","size":"11 MB","took":"now"}
Nov 06 20:00:04 demo1-cp-nt9h9-qxx28 rke2[2175]: {"level":"info","ts":"2024-11-06T20:00:04.404616+0300","logger":"etcd-client","caller":"snapshot/v3_snapshot.go:97","msg":"saved","path":"/var/lib/rancher/rke2/server/db/snapshots/etcd-snapshot-demo1-cp-nt9h9-qxx28-1730912404"}
Nov 06 20:00:04 demo1-cp-nt9h9-qxx28 rke2[2175]: time="2024-11-06T20:00:04+03:00" level=info msg="Saving snapshot metadata to /var/lib/rancher/rke2/server/db/.metadata/etcd-snapshot-demo1-cp-nt9h9-qxx28-1730912404"
Nov 06 20:00:04 demo1-cp-nt9h9-qxx28 rke2[2175]: time="2024-11-06T20:00:04+03:00" level=info msg="Applying snapshot retention=5 to local snapshots with prefix etcd-snapshot in /var/lib/rancher/rke2/server/db/snapshots"
Nov 06 20:00:04 demo1-cp-nt9h9-qxx28 rke2[2175]: time="2024-11-06T20:00:04+03:00" level=info msg="Reconciling ETCDSnapshotFile resources"
Nov 06 20:00:04 demo1-cp-nt9h9-qxx28 rke2[2175]: time="2024-11-06T20:00:04+03:00" level=info msg="Reconciliation of ETCDSnapshotFile resources complete"
and I am using the default latest , but since I face this issue then I thought it is version issue, I had one previously which I was able to bootstrap a successful cluster on vsphere so I though I should use the same version and see if does the case , so basically I installed latest and 2.9.0 both are the same error
c
is the cluster-agent deployment on that node running?
p
Copy code
root@demo1-cp-nt9h9-qxx28:~# k get pod -A
NAMESPACE       NAME                                                   READY   STATUS      RESTARTS   AGE
cattle-system   cattle-cluster-agent-67bbb75c77-fkdjd                  0/1     Pending     0          16h
kube-system     cilium-5scwg                                           1/1     Running     0          16h
kube-system     cilium-operator-684f86fff5-m5954                       1/1     Running     0          16h
kube-system     cilium-operator-684f86fff5-vfs77                       0/1     Pending     0          16h
kube-system     etcd-demo1-cp-nt9h9-qxx28                              1/1     Running     0          16h
kube-system     helm-install-rancher-vsphere-cpi-4f9bc                 0/1     Completed   0          16h
kube-system     helm-install-rancher-vsphere-csi-cjfw2                 0/1     Completed   0          16h
kube-system     helm-install-rke2-cilium-qtk4c                         0/1     Completed   0          16h
kube-system     helm-install-rke2-coredns-l9fdt                        0/1     Completed   0          16h
kube-system     helm-install-rke2-ingress-nginx-9428h                  0/1     Pending     0          16h
kube-system     helm-install-rke2-metrics-server-6nhcx                 0/1     Pending     0          16h
kube-system     helm-install-rke2-snapshot-controller-crd-p9d2j        0/1     Pending     0          16h
kube-system     helm-install-rke2-snapshot-controller-dl9vs            0/1     Pending     0          16h
kube-system     helm-install-rke2-snapshot-validation-webhook-xvzn4    0/1     Pending     0          16h
kube-system     kube-apiserver-demo1-cp-nt9h9-qxx28                    1/1     Running     0          16h
kube-system     kube-controller-manager-demo1-cp-nt9h9-qxx28           1/1     Running     0          16h
kube-system     kube-proxy-demo1-cp-nt9h9-qxx28                        1/1     Running     0          16h
kube-system     kube-scheduler-demo1-cp-nt9h9-qxx28                    1/1     Running     0          16h
kube-system     rancher-vsphere-cpi-cloud-controller-manager-9r88v     1/1     Running     0          16h
kube-system     rke2-coredns-rke2-coredns-7d8f866c78-5dj5n             0/1     Pending     0          16h
kube-system     rke2-coredns-rke2-coredns-autoscaler-75bc99ff8-s8z2j   0/1     Pending     0          16h
kube-system     vsphere-csi-controller-7dc7858ffc-n9hzz                0/5     Pending     0          16h
kube-system     vsphere-csi-controller-7dc7858ffc-s2m7p                0/5     Pending     0          16h
kube-system     vsphere-csi-controller-7dc7858ffc-z9kkk                0/5     Pending     0          16h
root@demo1-cp-nt9h9-qxx28:~#
c
ok so everything is pending. Describe one of the pending pods to see why.
I suspect you have insufficient CPU resources?
p
because the other CP nodes and worker nodes not available (not joined yet) then they will be pending
I said I have 3 cp node and 3 workers
I have only one cp node in ready stats and joined the cluster
c
you showed the node list and there’s only one node there.
Where are the other ones?
Check the logs on the other 2 servers to see why they haven’t joined the cluster?
p
because Rancher could not install rke2-server /agent there !
so they look like standalone nodes now
c
… why not
why did it fail to install rke2 on the other 5 nodes?
p
Copy code
root@demo1-cp-nt9h9-sv7th:~# journalctl -u rancher-system-agent.service
Nov 06 06:24:49 demo1-cp-nt9h9-sv7th systemd[1]: Started rancher-system-agent.service - Rancher System Agent.
Nov 06 06:24:49 demo1-cp-nt9h9-sv7th rancher-system-agent[1963]: time="2024-11-06T06:24:49+03:00" level=info msg="Rancher System Agent version v0.3.7 (bf4eb09) is starting"
Nov 06 06:24:49 demo1-cp-nt9h9-sv7th rancher-system-agent[1963]: time="2024-11-06T06:24:49+03:00" level=info msg="Using directory /var/lib/rancher/agent/work for work"
Nov 06 06:24:49 demo1-cp-nt9h9-sv7th rancher-system-agent[1963]: time="2024-11-06T06:24:49+03:00" level=info msg="Starting remote watch of plans"
Nov 06 06:24:49 demo1-cp-nt9h9-sv7th rancher-system-agent[1963]: time="2024-11-06T06:24:49+03:00" level=info msg="Starting /v1, Kind=Secret controller"
root@demo1-cp-nt9h9-sv7th:~#
this one is one if the other cp node ☝️
c
what about the other one
p
the same, let me check one worker...
c
Did you create the cluster with 1 cp/etcd/worker and then scale up to 3 or something?
p
no 3 cp 3 workers
Copy code
root@demo1-wk-d6rjr-42s5z:~# journalctl -u rancher-system-agent.service
Nov 06 06:20:03 demo1-wk-d6rjr-42s5z systemd[1]: Started rancher-system-agent.service - Rancher System Agent.
Nov 06 06:20:03 demo1-wk-d6rjr-42s5z rancher-system-agent[1959]: time="2024-11-06T06:20:03+03:00" level=info msg="Rancher System Agent version v0.3.7 (bf4eb09) is starting"
Nov 06 06:20:03 demo1-wk-d6rjr-42s5z rancher-system-agent[1959]: time="2024-11-06T06:20:03+03:00" level=info msg="Using directory /var/lib/rancher/agent/work for work"
Nov 06 06:20:03 demo1-wk-d6rjr-42s5z rancher-system-agent[1959]: time="2024-11-06T06:20:03+03:00" level=info msg="Starting remote watch of plans"
Nov 06 06:20:03 demo1-wk-d6rjr-42s5z rancher-system-agent[1959]: time="2024-11-06T06:20:03+03:00" level=info msg="Starting /v1, Kind=Secret controller"
root@demo1-wk-d6rjr-42s5z:~#
c
yeah it appears that it’s waiting on the first one. But the first one won’t finish coming up until those pending pods can all run. I suspect you’ll need to give it more memory/cpu or whatever it wants, and then restart rke2.
What kind of cpu/memory does that first cp+etcd node have?
p
same for both cp and workers
4 cpus, 32GB mem, 100GB disk
c
Check to see why the pods are pending. Usually its resources but maybe not.
also check the vsphere cpi pod logs to see if there are errors
p
Copy code
root@demo1-cp-nt9h9-qxx28:~# k -n kube-system describe pod/cilium-operator-684f86fff5-vfs77
Name:                 cilium-operator-684f86fff5-vfs77
Namespace:            kube-system
Priority:             2000000000
Priority Class Name:  system-cluster-critical
Service Account:      cilium-operator
Node:                 <none>
Labels:               <http://app.kubernetes.io/name=cilium-operator|app.kubernetes.io/name=cilium-operator>
                      <http://app.kubernetes.io/part-of=cilium|app.kubernetes.io/part-of=cilium>
                      io.cilium/app=operator
                      name=cilium-operator
                      pod-template-hash=684f86fff5
Annotations:          <http://prometheus.io/port|prometheus.io/port>: 9963
                      <http://prometheus.io/scrape|prometheus.io/scrape>: true
Status:               Pending
IP:
IPs:                  <none>
Controlled By:        ReplicaSet/cilium-operator-684f86fff5
Containers:
  cilium-operator:
    Image:      rancher/mirrored-cilium-operator-generic:v1.16.1
    Port:       9963/TCP
    Host Port:  9963/TCP
    Command:
      cilium-operator-generic
    Args:
      --config-dir=/tmp/cilium/config-map
      --debug=$(CILIUM_DEBUG)
    Liveness:   http-get <http://127.0.0.1:9234/healthz> delay=60s timeout=3s period=10s #success=1 #failure=3
    Readiness:  http-get <http://127.0.0.1:9234/healthz> delay=0s timeout=3s period=5s #success=1 #failure=5
    Environment:
      K8S_NODE_NAME:          (v1:spec.nodeName)
      CILIUM_K8S_NAMESPACE:  kube-system (v1:metadata.namespace)
      CILIUM_DEBUG:          <set to the key 'debug' of config map 'cilium-config'>  Optional: true
    Mounts:
      /tmp/cilium/config-map from cilium-config-path (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-drtsh (ro)
Conditions:
  Type           Status
  PodScheduled   False
Volumes:
  cilium-config-path:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      cilium-config
    Optional:  false
  kube-api-access-drtsh:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <http://kubernetes.io/os=linux|kubernetes.io/os=linux>
Tolerations:                 op=Exists
Events:
  Type     Reason            Age                  From               Message
  ----     ------            ----                 ----               -------
  Warning  FailedScheduling  19m (x199 over 16h)  default-scheduler  0/1 nodes are available: 1 node(s) didn't have free ports for the requested pod ports. preemption: 0/1 nodes are available: 1 node(s) didn't have free ports for the requested pod ports.
root@demo1-cp-nt9h9-qxx28:~#
c
not the cilium operator pod. that one is fine, there’s already a replica of that one running
p
Copy code
root@demo1-cp-nt9h9-qxx28:~# k -n kube-system describe pod/vsphere-csi-controller-7dc7858ffc-z9kkk
Name:             vsphere-csi-controller-7dc7858ffc-z9kkk
Namespace:        kube-system
Priority:         0
Service Account:  vsphere-csi-controller
Node:             <none>
Labels:           app=vsphere-csi-controller
                  <http://app.kubernetes.io/managed-by=Helm|app.kubernetes.io/managed-by=Helm>
                  <http://app.kubernetes.io/version=3.3.0-rancher1|app.kubernetes.io/version=3.3.0-rancher1>
                  <http://helm.sh/chart=rancher-vsphere-csi-3.3.0-rancher100|helm.sh/chart=rancher-vsphere-csi-3.3.0-rancher100>
                  pod-template-hash=7dc7858ffc
                  role=vsphere-csi
Annotations:      <none>
Status:           Pending
IP:
IPs:              <none>
Controlled By:    ReplicaSet/vsphere-csi-controller-7dc7858ffc
Containers:
  csi-attacher:
    Image:      rancher/mirrored-sig-storage-csi-attacher:v4.5.1
    Port:       <none>
    Host Port:  <none>
    Args:
      --v=4
      --timeout=300s
      --csi-address=$(ADDRESS)
      --leader-election
      --kube-api-qps=100
      --kube-api-burst=100
    Environment:
      ADDRESS:  /csi/csi.sock
    Mounts:
      /csi from socket-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8hrzq (ro)
  vsphere-csi-controller:
    Image:       rancher/mirrored-cloud-provider-vsphere-csi-release-driver:v3.3.0
    Ports:       9808/TCP, 2112/TCP
    Host Ports:  0/TCP, 0/TCP
    Args:
      --fss-name=<http://internal-feature-states.csi.vsphere.vmware.com|internal-feature-states.csi.vsphere.vmware.com>
      --fss-namespace=$(CSI_NAMESPACE)
    Liveness:  http-get http://:healthz/healthz delay=10s timeout=3s period=5s #success=1 #failure=3
    Environment:
      CSI_ENDPOINT:                     unix:///csi/csi.sock
      X_CSI_MODE:                       controller
      X_CSI_SPEC_DISABLE_LEN_CHECK:     true
      X_CSI_SERIAL_VOL_ACCESS_TIMEOUT:  3m
      VSPHERE_CSI_CONFIG:               /etc/cloud/csi-vsphere.conf
      LOGGER_LEVEL:                     PRODUCTION
      INCLUSTER_CLIENT_QPS:             100
      INCLUSTER_CLIENT_BURST:           100
      CSI_NAMESPACE:                    kube-system (v1:metadata.namespace)
    Mounts:
      /csi from socket-dir (rw)
      /etc/cloud from vsphere-config-volume (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8hrzq (ro)
  liveness-probe:
    Image:      rancher/mirrored-sig-storage-livenessprobe:v2.12.0
    Port:       <none>
    Host Port:  <none>
    Args:
      --v=4
      --csi-address=/csi/csi.sock
    Environment:  <none>
    Mounts:
      /csi from socket-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8hrzq (ro)
  vsphere-syncer:
    Image:      rancher/mirrored-cloud-provider-vsphere-csi-release-syncer:v3.3.0
    Port:       2113/TCP
    Host Port:  0/TCP
    Args:
      --leader-election
      --fss-name=<http://internal-feature-states.csi.vsphere.vmware.com|internal-feature-states.csi.vsphere.vmware.com>
      --fss-namespace=$(CSI_NAMESPACE)
    Environment:
      FULL_SYNC_INTERVAL_MINUTES:  30
      VSPHERE_CSI_CONFIG:          /etc/cloud/csi-vsphere.conf
      LOGGER_LEVEL:                PRODUCTION
      INCLUSTER_CLIENT_QPS:        100
      INCLUSTER_CLIENT_BURST:      100
      CSI_NAMESPACE:               kube-system (v1:metadata.namespace)
    Mounts:
      /etc/cloud from vsphere-config-volume (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8hrzq (ro)
  csi-provisioner:
    Image:      rancher/mirrored-sig-storage-csi-provisioner:v4.0.1
    Port:       <none>
    Host Port:  <none>
    Args:
      --v=4
      --timeout=300s
      --csi-address=$(ADDRESS)
      --kube-api-qps=100
      --kube-api-burst=100
      --leader-election
      --default-fstype=ext4
    Environment:
      ADDRESS:  /csi/csi.sock
    Mounts:
      /csi from socket-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8hrzq (ro)
Conditions:
  Type           Status
  PodScheduled   False
Volumes:
  vsphere-config-volume:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  vsphere-config-secret
    Optional:    false
  socket-dir:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  kube-api-access-8hrzq:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 <http://node-role.kubernetes.io/control-plane:NoSchedule|node-role.kubernetes.io/control-plane:NoSchedule> op=Exists
                             <http://node-role.kubernetes.io/controlplane=true:NoSchedule|node-role.kubernetes.io/controlplane=true:NoSchedule>
                             <http://node-role.kubernetes.io/etcd:NoExecute|node-role.kubernetes.io/etcd:NoExecute> op=Exists
                             <http://node-role.kubernetes.io/master:NoSchedule|node-role.kubernetes.io/master:NoSchedule> op=Exists
                             <http://node.kubernetes.io/not-ready:NoExecute|node.kubernetes.io/not-ready:NoExecute> op=Exists for 300s
                             <http://node.kubernetes.io/unreachable:NoExecute|node.kubernetes.io/unreachable:NoExecute> op=Exists for 300s
Events:
  Type     Reason            Age                  From               Message
  ----     ------            ----                 ----               -------
  Warning  FailedScheduling  22m (x197 over 16h)  default-scheduler  0/1 nodes are available: 1 node(s) had untolerated taint {<http://node.cloudprovider.kubernetes.io/uninitialized|node.cloudprovider.kubernetes.io/uninitialized>: true}. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling.
root@demo1-cp-nt9h9-qxx28:~#
c
check the cluster-agent and coredns pods. Also the vsphere cloud controller pod logs. I suspect that one will have errors in it, which is why the node is still tainted uninitialized.
yep. vsphere isn’t configured right. check the vsphere cpi pod logs
p
Copy code
root@demo1-cp-nt9h9-qxx28:~# k -n cattle-system describe pod/cattle-cluster-agent-67bbb75c77-fkdjd
Name:             cattle-cluster-agent-67bbb75c77-fkdjd
Namespace:        cattle-system
Priority:         0
Service Account:  cattle
Node:             <none>
Labels:           app=cattle-cluster-agent
                  pod-template-hash=67bbb75c77
Annotations:      <none>
Status:           Pending
IP:
IPs:              <none>
Controlled By:    ReplicaSet/cattle-cluster-agent-67bbb75c77
Containers:
  cluster-register:
    Image:      rancher/rancher-agent:v2.9.0
    Port:       <none>
    Host Port:  <none>
    Environment:
      CATTLE_FEATURES:           embedded-cluster-api=false,fleet=false,multi-cluster-management=false,multi-cluster-management-agent=true,provisioningv2=false,rke2=false,ui-sql-cache=false
      CATTLE_IS_RKE:             false
      CATTLE_SERVER:             <https://172.16.9.11.sslip.io>
      CATTLE_CA_CHECKSUM:        a14e4b6838604b9aa087b33958dee0a1fafd944c1d024114fb1eab4d89748a2a
      CATTLE_CLUSTER:            true
      CATTLE_K8S_MANAGED:        true
      CATTLE_CLUSTER_REGISTRY:
      CATTLE_SERVER_VERSION:     v2.9.0
      CATTLE_INSTALL_UUID:       f2801916-7c2e-45dc-a119-effce68a3ed5
      CATTLE_INGRESS_IP_DOMAIN:  <http://sslip.io|sslip.io>
      STRICT_VERIFY:             false
    Mounts:
      /cattle-credentials from cattle-credentials (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-9dq92 (ro)
Conditions:
  Type           Status
  PodScheduled   False
Volumes:
  cattle-credentials:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  cattle-credentials-d7916af
    Optional:    false
  kube-api-access-9dq92:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 <http://node-role.kubernetes.io/control-plane:NoSchedule|node-role.kubernetes.io/control-plane:NoSchedule>
                             <http://node-role.kubernetes.io/etcd:NoExecute|node-role.kubernetes.io/etcd:NoExecute>
                             <http://node.kubernetes.io/not-ready:NoExecute|node.kubernetes.io/not-ready:NoExecute> op=Exists for 300s
                             <http://node.kubernetes.io/unreachable:NoExecute|node.kubernetes.io/unreachable:NoExecute> op=Exists for 300s
Events:
  Type     Reason            Age                  From               Message
  ----     ------            ----                 ----               -------
  Warning  FailedScheduling  23m (x197 over 16h)  default-scheduler  0/1 nodes are available: 1 node(s) had untolerated taint {<http://node.cloudprovider.kubernetes.io/uninitialized|node.cloudprovider.kubernetes.io/uninitialized>: true}. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling.
root@demo1-cp-nt9h9-qxx28:~#
c
yep they’re all pending because the vsphere cpi hasn’t untainted the node yet
p
so what should I do in this situation ?
continue working to figure out whats wrong with your vsphere stuff?
p
but why it is working when I deploy the cluster on vsphere with the exactly the same config as the current and all is good ?!!! now I have 2 same version Rancher 2.9.0 one can deploy cluster on vsphere with no issue , the other one can't !!!
c
why don’t you check the logs and find out?
p
right, I check the logs and couldn't find out, that's why I posted here 😞
c
what’s in the log on the
rancher-vsphere-cpi-cloud-controller-manager-9r88v
pod
you have not showed anything about that yet
p
Copy code
E1106 20:58:39.699338       1 node_controller.go:240] error syncing 'demo1-cp-nt9h9-qxx28': failed to get instance metadata for node demo1-cp-nt9h9-qxx28: failed to get zone from cloud provider: Zone: Error fetching by providerID: vSphere region category k8s-region does not match any tags for mo: VirtualMachine:vm-1176 Error fetching by NodeName: vSphere region category k8s-region does not match any tags for mo: VirtualMachine:vm-1176, requeuing
I1106 21:00:00.851780       1 node_controller.go:425] Initializing node demo1-cp-nt9h9-qxx28 with cloud provider
I1106 21:00:00.851831       1 instances.go:102] instances.InstanceID() CACHED with demo1-cp-nt9h9-qxx28
I1106 21:00:00.851849       1 search.go:76] WhichVCandDCByNodeID nodeID: 42291d12-2fc2-de85-20e6-ffb494ee6e4f
I1106 21:00:00.873631       1 search.go:208] Found node 42291d12-2fc2-de85-20e6-ffb494ee6e4f as vm=VirtualMachine:vm-1176 in vc=192.168.200.200 and datacenter=My_Datacenter
I1106 21:00:00.873668       1 search.go:210] Hostname: demo1-cp-nt9h9-qxx28, UUID: 42291d12-2fc2-de85-20e6-ffb494ee6e4f
I1106 21:00:00.873696       1 nodemanager.go:168] Discovered VM using normal UUID format
I1106 21:00:00.883218       1 nodemanager.go:277] Adding Hostname: demo1-cp-nt9h9-qxx28
I1106 21:00:00.883267       1 nodemanager.go:449] Adding Internal IP: 172.16.9.65
I1106 21:00:00.883278       1 nodemanager.go:454] Adding External IP: 172.16.9.65
I1106 21:00:00.883299       1 nodemanager.go:358] Found node 42291d12-2fc2-de85-20e6-ffb494ee6e4f as vm=VirtualMachine:vm-1176 in vc=192.168.200.200 and datacenter=My_Datacenter
I1106 21:00:00.883320       1 nodemanager.go:360] Hostname: demo1-cp-nt9h9-qxx28 UUID: 42291d12-2fc2-de85-20e6-ffb494ee6e4f
I1106 21:00:00.883336       1 instances.go:77] instances.NodeAddressesByProviderID() FOUND with 42291d12-2fc2-de85-20e6-ffb494ee6e4f
E1106 21:00:01.371224       1 zones.go:394] Get zone for mo: HostSystem:host-1012: vSphere region category k8s-region does not match any tags for mo: HostSystem:host-1012
E1106 21:00:01.838151       1 zones.go:394] Get zone for mo: ResourcePool:resgroup-1007: vSphere region category k8s-region does not match any tags for mo: ResourcePool:resgroup-1007
E1106 21:00:02.664192       1 zones.go:394] Get zone for mo: VirtualMachine:vm-1176: vSphere region category k8s-region does not match any tags for mo: VirtualMachine:vm-1176
E1106 21:00:02.664242       1 zones.go:224] Failed to get host system properties. err: vSphere region category k8s-region does not match any tags for mo: VirtualMachine:vm-1176
E1106 21:00:03.484210       1 zones.go:394] Get zone for mo: HostSystem:host-1012: vSphere region category k8s-region does not match any tags for mo: HostSystem:host-1012
E1106 21:00:04.081031       1 zones.go:394] Get zone for mo: ResourcePool:resgroup-1007: vSphere region category k8s-region does not match any tags for mo: ResourcePool:resgroup-1007
E1106 21:00:04.662106       1 zones.go:394] Get zone for mo: VirtualMachine:vm-1176: vSphere region category k8s-region does not match any tags for mo: VirtualMachine:vm-1176
E1106 21:00:04.662130       1 zones.go:153] Failed to get host system properties. err: vSphere region category k8s-region does not match any tags for mo: VirtualMachine:vm-1176
I1106 21:00:04.662177       1 node_controller.go:229] error syncing 'demo1-cp-nt9h9-qxx28': failed to get instance metadata for node demo1-cp-nt9h9-qxx28: failed to get zone from cloud provider: Zone: Error fetching by providerID: vSphere region category k8s-region does not match any tags for mo: VirtualMachine:vm-1176 Error fetching by NodeName: vSphere region category k8s-region does not match any tags for mo: VirtualMachine:vm-1176, requeuing
E1106 21:00:04.662200       1 node_controller.go:240] error syncing 'demo1-cp-nt9h9-qxx28': failed to get instance metadata for node demo1-cp-nt9h9-qxx28: failed to get zone from cloud provider: Zone: Error fetching by providerID: vSphere region category k8s-region does not match any tags for mo: VirtualMachine:vm-1176 Error fetching by NodeName: vSphere region category k8s-region does not match any tags for mo: VirtualMachine:vm-1176, requeuing
I1106 21:01:37.136563       1 node_controller.go:425] Initializing node demo1-cp-nt9h9-qxx28 with cloud provider
I1106 21:01:37.136619       1 instances.go:102] instances.InstanceID() CACHED with demo1-cp-nt9h9-qxx28
I1106 21:01:37.136637       1 search.go:76] WhichVCandDCByNodeID nodeID: 42291d12-2fc2-de85-20e6-ffb494ee6e4f
I1106 21:01:37.172053       1 search.go:208] Found node 42291d12-2fc2-de85-20e6-ffb494ee6e4f as vm=VirtualMachine:vm-1176 in vc=192.168.200.200 and datacenter=My_Datacenter
I1106 21:01:37.172092       1 search.go:210] Hostname: demo1-cp-nt9h9-qxx28, UUID: 42291d12-2fc2-de85-20e6-ffb494ee6e4f
I1106 21:01:37.172121       1 nodemanager.go:168] Discovered VM using normal UUID format
I1106 21:01:37.182398       1 nodemanager.go:277] Adding Hostname: demo1-cp-nt9h9-qxx28
I1106 21:01:37.182457       1 nodemanager.go:449] Adding Internal IP: 172.16.9.65
I1106 21:01:37.182469       1 nodemanager.go:454] Adding External IP: 172.16.9.65
I1106 21:01:37.182488       1 nodemanager.go:358] Found node 42291d12-2fc2-de85-20e6-ffb494ee6e4f as vm=VirtualMachine:vm-1176 in vc=192.168.200.200 and datacenter=My_Datacenter
I1106 21:01:37.182516       1 nodemanager.go:360] Hostname: demo1-cp-nt9h9-qxx28 UUID: 42291d12-2fc2-de85-20e6-ffb494ee6e4f
I1106 21:01:37.182530       1 instances.go:77] instances.NodeAddressesByProviderID() FOUND with 42291d12-2fc2-de85-20e6-ffb494ee6e4f
E1106 21:01:37.821271       1 zones.go:394] Get zone for mo: HostSystem:host-1012: vSphere region category k8s-region does not match any tags for mo: HostSystem:host-1012
E1106 21:01:38.381515       1 zones.go:394] Get zone for mo: ResourcePool:resgroup-1007: vSphere region category k8s-region does not match any tags for mo: ResourcePool:resgroup-1007
E1106 21:01:38.854240       1 zones.go:394] Get zone for mo: VirtualMachine:vm-1176: vSphere region category k8s-region does not match any tags for mo: VirtualMachine:vm-1176
E1106 21:01:38.854274       1 zones.go:224] Failed to get host system properties. err: vSphere region category k8s-region does not match any tags for mo: VirtualMachine:vm-1176
E1106 21:01:39.344343       1 zones.go:394] Get zone for mo: HostSystem:host-1012: vSphere region category k8s-region does not match any tags for mo: HostSystem:host-1012
E1106 21:01:39.783754       1 zones.go:394] Get zone for mo: ResourcePool:resgroup-1007: vSphere region category k8s-region does not match any tags for mo: ResourcePool:resgroup-1007
E1106 21:01:40.236813       1 zones.go:394] Get zone for mo: VirtualMachine:vm-1176: vSphere region category k8s-region does not match any tags for mo: VirtualMachine:vm-1176
E1106 21:01:40.236846       1 zones.go:153] Failed to get host system properties. err: vSphere region category k8s-region does not match any tags for mo: VirtualMachine:vm-1176
I1106 21:01:40.236881       1 node_controller.go:229] error syncing 'demo1-cp-nt9h9-qxx28': failed to get instance metadata for node demo1-cp-nt9h9-qxx28: failed to get zone from cloud provider: Zone: Error fetching by providerID: vSphere region category k8s-region does not match any tags for mo: VirtualMachine:vm-1176 Error fetching by NodeName: vSphere region category k8s-region does not match any tags for mo: VirtualMachine:vm-1176, requeuing
E1106 21:01:40.236899       1 node_controller.go:240] error syncing 'demo1-cp-nt9h9-qxx28': failed to get instance metadata for node demo1-cp-nt9h9-qxx28: failed to get zone from cloud provider: Zone: Error fetching by providerID: vSphere region category k8s-region does not match any tags for mo: VirtualMachine:vm-1176 Error fetching by NodeName: vSphere region category k8s-region does not match any tags for mo: VirtualMachine:vm-1176, requeuing
I1106 21:01:59.757764       1 node_controller.go:267] Update 1 nodes status took 75.736µs.
I1106 21:03:43.918119       1 node_controller.go:425] Initializing node demo1-cp-nt9h9-qxx28 with cloud provider
I1106 21:03:43.919408       1 instances.go:102] instances.InstanceID() CACHED with demo1-cp-nt9h9-qxx28
I1106 21:03:43.919431       1 search.go:76] WhichVCandDCByNodeID nodeID: 42291d12-2fc2-de85-20e6-ffb494ee6e4f
I1106 21:03:43.951455       1 search.go:208] Found node 42291d12-2fc2-de85-20e6-ffb494ee6e4f as vm=VirtualMachine:vm-1176 in vc=192.168.200.200 and datacenter=My_Datacenter
I1106 21:03:43.951493       1 search.go:210] Hostname: demo1-cp-nt9h9-qxx28, UUID: 42291d12-2fc2-de85-20e6-ffb494ee6e4f
I1106 21:03:43.951522       1 nodemanager.go:168] Discovered VM using normal UUID format
I1106 21:03:43.961245       1 nodemanager.go:277] Adding Hostname: demo1-cp-nt9h9-qxx28
I1106 21:03:43.961296       1 nodemanager.go:449] Adding Internal IP: 172.16.9.65
I1106 21:03:43.961308       1 nodemanager.go:454] Adding External IP: 172.16.9.65
I1106 21:03:43.961328       1 nodemanager.go:358] Found node 42291d12-2fc2-de85-20e6-ffb494ee6e4f as vm=VirtualMachine:vm-1176 in vc=192.168.200.200 and datacenter=My_Datacenter
I1106 21:03:43.961350       1 nodemanager.go:360] Hostname: demo1-cp-nt9h9-qxx28 UUID: 42291d12-2fc2-de85-20e6-ffb494ee6e4f
I1106 21:03:43.961365       1 instances.go:77] instances.NodeAddressesByProviderID() FOUND with 42291d12-2fc2-de85-20e6-ffb494ee6e4f
E1106 21:03:44.455823       1 zones.go:394] Get zone for mo: HostSystem:host-1012: vSphere region category k8s-region does not match any tags for mo: HostSystem:host-1012
E1106 21:03:44.885310       1 zones.go:394] Get zone for mo: ResourcePool:resgroup-1007: vSphere region category k8s-region does not match any tags for mo: ResourcePool:resgroup-1007
E1106 21:03:45.336458       1 zones.go:394] Get zone for mo: VirtualMachine:vm-1176: vSphere region category k8s-region does not match any tags for mo: VirtualMachine:vm-1176
E1106 21:03:45.336491       1 zones.go:224] Failed to get host system properties. err: vSphere region category k8s-region does not match any tags for mo: VirtualMachine:vm-1176
E1106 21:03:45.919046       1 zones.go:394] Get zone for mo: HostSystem:host-1012: vSphere region category k8s-region does not match any tags for mo: HostSystem:host-1012
E1106 21:03:46.338658       1 zones.go:394] Get zone for mo: ResourcePool:resgroup-1007: vSphere region category k8s-region does not match any tags for mo: ResourcePool:resgroup-1007
E1106 21:03:46.840754       1 zones.go:394] Get zone for mo: VirtualMachine:vm-1176: vSphere region category k8s-region does not match any tags for mo: VirtualMachine:vm-1176
E1106 21:03:46.840790       1 zones.go:153] Failed to get host system properties. err: vSphere region category k8s-region does not match any tags for mo: VirtualMachine:vm-1176
I1106 21:03:46.840868       1 node_controller.go:229] error syncing 'demo1-cp-nt9h9-qxx28': failed to get instance metadata for node demo1-cp-nt9h9-qxx28: failed to get zone from cloud provider: Zone: Error fetching by providerID: vSphere region category k8s-region does not match any tags for mo: VirtualMachine:vm-1176 Error fetching by NodeName: vSphere region category k8s-region does not match any tags for mo: VirtualMachine:vm-1176, requeuing
E1106 21:03:46.840912       1 node_controller.go:240] error syncing 'demo1-cp-nt9h9-qxx28': failed to get instance metadata for node demo1-cp-nt9h9-qxx28: failed to get zone from cloud provider: Zone: Error fetching by providerID: vSphere region category k8s-region does not match any tags for mo: VirtualMachine:vm-1176 Error fetching by NodeName: vSphere region category k8s-region does not match any tags for mo: VirtualMachine:vm-1176, requeuing
c
There you go. Fix that.
p
right, so should I disable this on the UI ?
c
idk, it depends on how you’ve set up vsphere. If you’re not going to tag your VM hosts then you probably shouldn’t tell vsphere to expect region and zone tags.
p
right, it seems that was the issue, now I can see nodes are joining the cluster...
c
yep! that was why I was suggesting you check the pod logs!
So this was probably something you were doing differently across the two Rancher servers? Nothing to do with the version or anything else?
p
most likely
c
This is all vsphere stuff. It is documented in their project: https://cloud-provider-vsphere.sigs.k8s.io/tutorials/deploying_cpi_with_multi_dc_vc_aka_zones
See the
Creating Zones in your vSphere Environment via Tags
section
p
amazing, you are right , thank you Brandon for your help 🙏 You saved my day , appreciated
they are all in Ready stats , I will try to use the latest version of Rancher and will give it a try
👍 1
Thanks again !
Hi Brandon , quick question please, I deleted this cluster and I created new one without enabling the vSphere tags (Region and Zone) again I have the same issue:
no logs...
c
do the same thing man
look at the pending pods to see why they’re pending, look at the vsphere cpi pod logs to see what you did wrong if the pending pods are pending because the node is tainted uninitialized
p
even if it is running, because this time cpi is running, still I should check the logs, right ?
c
the cpi was running last time, wasn’t it? It just had errors in its log.
why would you not check the logs? Whats the downside?
p
you are absolutely right, but since the cpi is running and the others is in pending mode, so there is no log for the pending ones , alright I will check the cpi logs
• alright, I found the issue, actually it is a rancher bug, when you create cluster using rancher ui, and when you chose the cloud provider
vsphere
the on the left side there is: • Add-on: vSphere CPI • Add-on: vSphere CSI which you need to provide the vsphere data there, including credentials, when you add credentials even if you enable the option
Generate Credential's Secret
it will not generate it for both cpi amd csi, and this will cause problem for CPI as well as CSI and as a result cluster will not start properly. so I tried it twice it is true that rancher does not generate secret for cpi and csi, so you should go back to ui and
Edit Config
for the cluster and go the • Add-on: vSphere CPI • Add-on: vSphere CSI section and re-enter the credentials and make sure the check box for
Generate Credential's Secret
is on and after that the remaining pending pods will be initialized and running and the pending nodes will join the cluster.
Probably I should create an issue on Github for this matter
100 Views