This message was deleted.
# vsphere
a
This message was deleted.
w
• Rancher v2.7.5 on RKE2 running on v1.25.12+rke2r1 • Target: vSphere 6.7.0.54000 (6.7 U3s). Tried single node deployment and 3-node deployment with all roles. • Cluster Creation status: Configuring bootstrap node(s) argh-pool1-7cbd9f6556-gfnqh: waiting for cluster agent to connect • Cluster is running on downstream cluster but many pods in Pending
Copy code
NAMESPACE         NAME                                                    READY   STATUS      RESTARTS   AGE
calico-system     calico-kube-controllers-b69ccbb87-mp6t9                 0/1     Pending     0          26m
calico-system     calico-node-lsznm                                       0/1     Running     0          26m
calico-system     calico-typha-68b6d7d7c9-2vkhk                           0/1     Pending     0          26m
cattle-system     cattle-cluster-agent-6c8c99c76f-989sk                   0/1     Pending     0          27m
kube-system       etcd-argh-pool1-eaaadf3e-4h4vs                          1/1     Running     0          27m
kube-system       helm-install-rancher-vsphere-cpi-kwhdq                  0/1     Completed   0          27m
kube-system       helm-install-rancher-vsphere-csi-h4xbq                  0/1     Completed   0          27m
kube-system       helm-install-rke2-calico-7lzfw                          0/1     Completed   2          27m
kube-system       helm-install-rke2-calico-crd-j47sl                      0/1     Completed   0          27m
kube-system       helm-install-rke2-coredns-ggf9w                         0/1     Completed   0          27m
kube-system       helm-install-rke2-ingress-nginx-722b9                   0/1     Pending     0          27m
kube-system       helm-install-rke2-metrics-server-n5whn                  0/1     Pending     0          27m
kube-system       helm-install-rke2-snapshot-controller-crd-f5bs4         0/1     Pending     0          27m
kube-system       helm-install-rke2-snapshot-controller-x25pc             0/1     Pending     0          27m
kube-system       helm-install-rke2-snapshot-validation-webhook-sk5fm     0/1     Pending     0          27m
kube-system       kube-apiserver-argh-pool1-eaaadf3e-4h4vs                1/1     Running     0          26m
kube-system       kube-controller-manager-argh-pool1-eaaadf3e-4h4vs       1/1     Running     0          27m
kube-system       kube-proxy-argh-pool1-eaaadf3e-4h4vs                    1/1     Running     0          27m
kube-system       kube-scheduler-argh-pool1-eaaadf3e-4h4vs                1/1     Running     0          27m
kube-system       rancher-vsphere-cpi-cloud-controller-manager-pdk8f      1/1     Running     0          26m
kube-system       rke2-coredns-rke2-coredns-7c98b7488c-r2fq4              0/1     Pending     0          26m
kube-system       rke2-coredns-rke2-coredns-autoscaler-65b5bfc754-d5flx   0/1     Pending     0          26m
kube-system       vsphere-csi-controller-c6b684f79-cxzrq                  0/5     Pending     0          26m
kube-system       vsphere-csi-controller-c6b684f79-fk56s                  0/5     Pending     0          26m
kube-system       vsphere-csi-controller-c6b684f79-rz5h2                  0/5     Pending     0          26m
tigera-operator   tigera-operator-6869bc46c4-hkqsq                        1/1     Running     0          26m
systemctl status rancher-system-agent
Copy code
● rancher-system-agent.service - Rancher System Agent
     Loaded: loaded (/etc/systemd/system/rancher-system-agent.service; enabled; vendor preset: enabled)
     Active: active (running) since Thu 2023-08-03 22:57:47 UTC; 16min ago
       Docs: <https://www.rancher.com>
   Main PID: 21155 (rancher-system-)
      Tasks: 10 (limit: 4663)
     Memory: 103.2M
     CGroup: /system.slice/rancher-system-agent.service
             └─21155 /usr/local/bin/rancher-system-agent sentinel
journalctl -eu rancher-system-agent
Copy code
Aug 03 23:12:09 argh-pool1-eaaadf3e-4h4vs rancher-system-agent[21155]: time="2023-08-03T23:12:09Z" level=info msg="[K8s] updated plan secret fleet-default/argh-bootstrap-template-4sbq9-machine-plan with feedback"
Aug 03 23:12:09 argh-pool1-eaaadf3e-4h4vs rancher-system-agent[21155]: time="2023-08-03T23:12:09Z" level=info msg="[K8s] updated plan secret fleet-default/argh-bootstrap-template-4sbq9-machine-plan with feedback"
Aug 03 23:17:51 argh-pool1-eaaadf3e-4h4vs rancher-system-agent[21155]: time="2023-08-03T23:17:51Z" level=info msg="[Applyinator] No image provided, creating empty working directory /var/lib/rancher/agent/work/20230803-231751/e86f09f842136ac61b0f1af68fdeb4862090e49902da9fd63deab74bd11366f6_0"
Aug 03 23:17:51 argh-pool1-eaaadf3e-4h4vs rancher-system-agent[21155]: time="2023-08-03T23:17:51Z" level=info msg="[Applyinator] Running command: sh [-c rke2 etcd-snapshot list --etcd-s3=false 2>/dev/null]"
Aug 03 23:17:51 argh-pool1-eaaadf3e-4h4vs rancher-system-agent[21155]: time="2023-08-03T23:17:51Z" level=info msg="[e86f09f842136ac61b0f1af68fdeb4862090e49902da9fd63deab74bd11366f6_0:stdout]: Name Location Size Created"
Aug 03 23:17:51 argh-pool1-eaaadf3e-4h4vs rancher-system-agent[21155]: time="2023-08-03T23:17:51Z" level=info msg="[Applyinator] Command sh [-c rke2 etcd-snapshot list --etcd-s3=false 2>/dev/null] finished with err: <nil> and exit code: 0"
Aug 03 23:17:51 argh-pool1-eaaadf3e-4h4vs rancher-system-agent[21155]: time="2023-08-03T23:17:51Z" level=info msg="[K8s] updated plan secret fleet-default/argh-bootstrap-template-4sbq9-machine-plan with feedback"
Copy code
kubectl describe nodes | egrep "Taints:|Name:"
Name:               argh-pool1-eaaadf3e-4h4vs
Taints:             <http://node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule|node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule>
a
Grab the logs from
rancher-vsphere-cpi-cloud-controller-manager-pdk8f
It's likely the CPI pod hasn't finished doing what it it needs to do that eventually removes the above taint from your nodes
w
Thanks, David! I had a bad password set for some damn reason. The only non-running pod now is the catttle-cluster-agent
Copy code
kubectl describe -n cattle-system pod cattle-cluster-agent-6c8c99c76f-989sk
Name:             cattle-cluster-agent-6c8c99c76f-989sk
Namespace:        cattle-system
Priority:         0
Service Account:  cattle
Node:             argh-pool1-eaaadf3e-4h4vs/192.168.0.79
Start Time:       Fri, 04 Aug 2023 10:30:24 +0000
Labels:           app=cattle-cluster-agent
                  pod-template-hash=6c8c99c76f
Annotations:      <http://cni.projectcalico.org/containerID|cni.projectcalico.org/containerID>: 99697cbfa037ff47d6f81ad5d12f2111328292732217b313a1e7c6388dbfa85b
                  <http://cni.projectcalico.org/podIP|cni.projectcalico.org/podIP>: 10.42.44.135/32
                  <http://cni.projectcalico.org/podIPs|cni.projectcalico.org/podIPs>: 10.42.44.135/32
Status:           Running
IP:               10.42.44.135
IPs:
  IP:           10.42.44.135
Controlled By:  ReplicaSet/cattle-cluster-agent-6c8c99c76f
Containers:
  cluster-register:
    Container ID:   <containerd://124398c62f465b89871f01cbe21834371a8f06e865cd196db1db1b583c88777>2
    Image:          rancher/rancher-agent:v2.7.5
    Image ID:       <http://docker.io/rancher/rancher-agent@sha256:dd0c335170297cc5797f566ce057383c4406825d1e9fa364267a705b38c199ea|docker.io/rancher/rancher-agent@sha256:dd0c335170297cc5797f566ce057383c4406825d1e9fa364267a705b38c199ea>
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Fri, 04 Aug 2023 10:35:49 +0000
      Finished:     Fri, 04 Aug 2023 10:35:49 +0000
    Ready:          False
    Restart Count:  5
    Environment:
      CATTLE_FEATURES:           embedded-cluster-api=false,fleet=false,monitoringv1=false,multi-cluster-management=false,multi-cluster-management-agent=true,provisioningv2=false,rke2=false
      CATTLE_IS_RKE:             false
      CATTLE_SERVER:             <https://rancherdev.dev-rke2.systemsmanaged.co.uk>
      CATTLE_CA_CHECKSUM:        d2cf6d229a85166f37b856f29493f5a4ae6b701aa97837022cb90f380cdaf743
      CATTLE_CLUSTER:            true
      CATTLE_K8S_MANAGED:        true
      CATTLE_CLUSTER_REGISTRY:
      CATTLE_SERVER_VERSION:     v2.7.5
      CATTLE_INSTALL_UUID:       6added34-5210-40c1-a567-5eb6f7265716
      CATTLE_INGRESS_IP_DOMAIN:  <http://sslip.io|sslip.io>
    Mounts:
      /cattle-credentials from cattle-credentials (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ctdtr (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  cattle-credentials:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  cattle-credentials-21bca00
    Optional:    false
  kube-api-access-ctdtr:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 <http://node-role.kubernetes.io/control-plane:NoSchedule|node-role.kubernetes.io/control-plane:NoSchedule> op=Exists
                             <http://node-role.kubernetes.io/controlplane=true:NoSchedule|node-role.kubernetes.io/controlplane=true:NoSchedule>
                             <http://node-role.kubernetes.io/master:NoSchedule|node-role.kubernetes.io/master:NoSchedule> op=Exists
                             <http://node.kubernetes.io/not-ready:NoExecute|node.kubernetes.io/not-ready:NoExecute> op=Exists for 300s
                             <http://node.kubernetes.io/unreachable:NoExecute|node.kubernetes.io/unreachable:NoExecute> op=Exists for 300s
Events:
  Type     Reason            Age                     From               Message
  ----     ------            ----                    ----               -------
  Warning  FailedScheduling  28m (x136 over 11h)     default-scheduler  0/1 nodes are available: 1 node(s) had untolerated taint {<http://node.cloudprovider.kubernetes.io/uninitialized|node.cloudprovider.kubernetes.io/uninitialized>: true}. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling..
  Normal   Scheduled         7m24s                   default-scheduler  Successfully assigned cattle-system/cattle-cluster-agent-6c8c99c76f-989sk to argh-pool1-eaaadf3e-4h4vs
  Normal   Pulling           7m17s                   kubelet            Pulling image "rancher/rancher-agent:v2.7.5"
  Normal   Pulled            4m59s                   kubelet            Successfully pulled image "rancher/rancher-agent:v2.7.5" in 2m17.379704813s (2m17.379731619s including waiting)
  Normal   Created           3m23s (x5 over 4m59s)   kubelet            Created container cluster-register
  Normal   Pulled            3m23s (x4 over 4m55s)   kubelet            Container image "rancher/rancher-agent:v2.7.5" already present on machine
  Normal   Started           3m22s (x5 over 4m59s)   kubelet            Started container cluster-register
  Warning  BackOff           2m13s (x12 over 4m53s)  kubelet            Back-off restarting failed container cluster-register in pod cattle-cluster-agent-6c8c99c76f-989sk_cattle-system(068a3ae5-80d0-4415-ba8e-d151fc932f26)
a
Can you grab the cluster agent pod logs?
w
Interesting. kubectl logs -n cattle-system cattle-cluster-agent-6c8c99c76f-989sk
Copy code
INFO: Environment: CATTLE_ADDRESS=10.42.44.135 CATTLE_CA_CHECKSUM=d2cf6d229a85166f37b856f29493f5a4ae6b701aa97837022cb90f380cdaf743 CATTLE_CLUSTER=true CATTLE_CLUSTER_AGENT_PORT=<tcp://10.43.239.76:80> CATTLE_CLUSTER_AGENT_PORT_443_TCP=<tcp://10.43.239.76:443> CATTLE_CLUSTER_AGENT_PORT_443_TCP_ADDR=10.43.239.76 CATTLE_CLUSTER_AGENT_PORT_443_TCP_PORT=443 CATTLE_CLUSTER_AGENT_PORT_443_TCP_PROTO=tcp CATTLE_CLUSTER_AGENT_PORT_80_TCP=<tcp://10.43.239.76:80> CATTLE_CLUSTER_AGENT_PORT_80_TCP_ADDR=10.43.239.76 CATTLE_CLUSTER_AGENT_PORT_80_TCP_PORT=80 CATTLE_CLUSTER_AGENT_PORT_80_TCP_PROTO=tcp CATTLE_CLUSTER_AGENT_SERVICE_HOST=10.43.239.76 CATTLE_CLUSTER_AGENT_SERVICE_PORT=80 CATTLE_CLUSTER_AGENT_SERVICE_PORT_HTTP=80 CATTLE_CLUSTER_AGENT_SERVICE_PORT_HTTPS_INTERNAL=443 CATTLE_CLUSTER_REGISTRY= CATTLE_FEATURES=embedded-cluster-api=false,fleet=false,monitoringv1=false,multi-cluster-management=false,multi-cluster-management-agent=true,provisioningv2=false,rke2=false CATTLE_INGRESS_IP_DOMAIN=<http://sslip.io|sslip.io> CATTLE_INSTALL_UUID=6added34-5210-40c1-a567-5eb6f7265716 CATTLE_INTERNAL_ADDRESS= CATTLE_IS_RKE=false CATTLE_K8S_MANAGED=true CATTLE_NODE_NAME=cattle-cluster-agent-6c8c99c76f-989sk CATTLE_RANCHER_WEBHOOK_MIN_VERSION= CATTLE_RANCHER_WEBHOOK_VERSION=2.0.5+up0.3.5 CATTLE_SERVER=<https://rancherdev.dev-rke2.mydomainname.co.uk> CATTLE_SERVER_VERSION=v2.7.5
INFO: Using resolv.conf: search cattle-system.svc.cluster.local svc.cluster.local cluster.local nameserver 10.43.0.10 options ndots:5
ERROR: <https://rancherdev.dev-rke2.mydomainname.co.uk/ping> is not accessible (Could not resolve host: <http://rancherdev.dev-rke2.mydomainname.co.uk|rancherdev.dev-rke2.mydomainname.co.uk>)
root@argh-pool1-eaaadf3e-4h4vs:~# nslookup rancherdev.dev-rke2.mydomainname.co.uk
Copy code
Server:         127.0.0.53
Address:        127.0.0.53#53

Non-authoritative answer:
<http://rancherdev.dev-rke2.mydomainname.co.uk|rancherdev.dev-rke2.mydomainname.co.uk>        canonical name = <http://dev-rke2-1.dev-rke2.mydomainname.co.uk|dev-rke2-1.dev-rke2.mydomainname.co.uk>.
Name:   <http://dev-rke2-1.dev-rke2.mydomainname.co.uk|dev-rke2-1.dev-rke2.mydomainname.co.uk>
Address: 192.168.0.149
All the other pods are running, Just the -n cattle-system cattle-cluster-agent-xxxx pod in CrashLoopBackOff at the moment
a
Can you resolve your rancher hostname from within the cluster?
Ie from a Pod
I see you can from what looks like the node
w
Well, things seem to have moved forward. Appreciate the help with spotting the issue at my end.
a
Love the cluster name
w
Attempt #19 will do that to a man