This message was deleted.
# rke2
a
This message was deleted.
a
Adding some logs: Logs Rancher Server
Copy code
2024/06/04 17:15:32 [ERROR] error syncing '_all_': handler user-controllers-controller: userControllersController: failed to set peers for key _all_: failed to start user controllers fo │
│ r cluster c-m-jsx7jx2x: ClusterUnavailable 503: cluster not found, requeuing                                                                                                              │
│ 2024/06/04 17:17:32 [ERROR] error syncing '_all_': handler user-controllers-controller: userControllersController: failed to set peers for key _all_: failed to start user controllers fo │
│ r cluster c-m-jsx7jx2x: ClusterUnavailable 503: cluster not found, requeuing                                                                                                              │
│ W0604 17:18:22.952980      39 transport.go:301] Unable to cancel request for *client.addQuery                                                                                             │
│ W0604 17:18:22.953391      39 transport.go:301] Unable to cancel request for *client.addQuery
Copy code
│ 2024/06/04 17:35:33 [ERROR] error syncing 'c-m-jsx7jx2x': handler cluster-deploy: cluster context c-m-jsx7jx2x is unavailable, requeuing                                                  │
│ 2024/06/04 17:35:33 [INFO] [planner] rkecluster fleet-default/cluster-a: configuring bootstrap node(s) custom-3290b4cb1b95: waiting for cluster agent to connect                          │
│ 2024/06/04 17:35:59 [ERROR] error syncing '_all_': handler user-controllers-controller: userControllersController: failed to set peers for key _all_: failed to start user controllers fo │
│ r cluster c-m-jsx7jx2x: ClusterUnavailable 503: cluster not found, requeuing                                                                                                              │
│ W0604 17:36:31.090158      39 warnings.go:80] <http://cluster.x-k8s.io/v1alpha3|cluster.x-k8s.io/v1alpha3> MachineDeployment is deprecated; use <http://cluster.x-k8s.io/v1beta1|cluster.x-k8s.io/v1beta1> MachineDeployment                                   │
│ W0604 17:37:22.310487      39 warnings.go:80] <http://cluster.x-k8s.io/v1alpha3|cluster.x-k8s.io/v1alpha3> MachineSet is deprecated; use <http://cluster.x-k8s.io/v1beta1|cluster.x-k8s.io/v1beta1> MachineSet                                                 │
│ 2024/06/04 17:37:27 [INFO] [planner] rkecluster fleet-default/cluster-a: configuring bootstrap node(s) custom-3290b4cb1b95: waiting for cluster agent to connect                          │
│ 2024/06/04 17:37:33 [ERROR] error syncing 'c-m-jsx7jx2x': handler cluster-deploy: cluster context c-m-jsx7jx2x is unavailable, requeuing                                                  │
│ 2024/06/04 17:37:33 [INFO] [planner] rkecluster fleet-default/cluster-a: configuring bootstrap node(s) custom-3290b4cb1b95: waiting for cluster agent to connect                          │
│ 2024/06/04 17:37:59 [ERROR] error syncing '_all_': handler user-controllers-controller: userControllersController: failed to set peers for key _all_: failed to start user controllers fo │
│ r cluster c-m-jsx7jx2x: ClusterUnavailable 503: cluster not found, requeuing                                                                                                              │
│ W0604 17:38:16.359467      39 warnings.go:80] <http://cluster.x-k8s.io/v1alpha3|cluster.x-k8s.io/v1alpha3> MachineHealthCheck is deprecated; use <http://cluster.x-k8s.io/v1beta1|cluster.x-k8s.io/v1beta1> MachineHealthCheck                                 │
│ 2024/06/04 17:38:38 [INFO] [planner] rkecluster fleet-default/cluster-a: configuring bootstrap node(s) custom-3290b4cb1b95: waiting for cluster agent to connect
Logs Fleet Controller:
Copy code
│ time="2024-06-03T12:15:00Z" level=info msg="All controllers have been started"                                                                                                            │
│ time="2024-06-03T12:15:00Z" level=info msg="Starting <http://rbac.authorization.k8s.io/v1|rbac.authorization.k8s.io/v1>, Kind=Role controller"                                                                                  │
│ time="2024-06-03T12:15:00Z" level=info msg="Starting <http://fleet.cattle.io/v1alpha1|fleet.cattle.io/v1alpha1>, Kind=ClusterRegistration controller"                                                                       │
│ time="2024-06-03T12:15:00Z" level=info msg="Starting <http://fleet.cattle.io/v1alpha1|fleet.cattle.io/v1alpha1>, Kind=ClusterRegistrationToken controller"                                                                  │
│ time="2024-06-03T12:15:00Z" level=info msg="Starting /v1, Kind=ConfigMap controller"                                                                                                      │
│ time="2024-06-03T12:15:00Z" level=info msg="Starting <http://fleet.cattle.io/v1alpha1|fleet.cattle.io/v1alpha1>, Kind=Cluster controller"                                                                                   │
│ time="2024-06-03T12:15:00Z" level=info msg="Starting <http://rbac.authorization.k8s.io/v1|rbac.authorization.k8s.io/v1>, Kind=RoleBinding controller"                                                                           │
│ time="2024-06-03T12:15:00Z" level=info msg="Starting <http://fleet.cattle.io/v1alpha1|fleet.cattle.io/v1alpha1>, Kind=GitRepoRestriction controller"                                                                        │
│ time="2024-06-03T12:15:00Z" level=info msg="Starting <http://fleet.cattle.io/v1alpha1|fleet.cattle.io/v1alpha1>, Kind=Bundle controller"                                                                                    │
│ time="2024-06-03T12:15:00Z" level=info msg="Starting <http://rbac.authorization.k8s.io/v1|rbac.authorization.k8s.io/v1>, Kind=ClusterRoleBinding controller"                                                                    │
│ time="2024-06-03T12:15:00Z" level=info msg="Starting <http://rbac.authorization.k8s.io/v1|rbac.authorization.k8s.io/v1>, Kind=ClusterRole controller"                                                                           │
│ time="2024-06-03T12:15:00Z" level=info msg="Starting /v1, Kind=ServiceAccount controller"                                                                                                 │
│ time="2024-06-03T14:11:01Z" level=info msg="While calculating status.ResourceKey, error running helm template for bundle mcc-cluster-a-managed-system-upgrade-controller with target opti │
│ ons from : chart requires kubeVersion: >= 1.23.0-0 which is incompatible with Kubernetes v1.20.0"
Logs Capi Controller Manager
Copy code
I0604 17:41:16.246263       1 machine_controller_phases.go:286] "Waiting for infrastructure provider to create machine infrastructure and report status.ready" controller="machine" contr │
│ ollerGroup="<http://cluster.x-k8s.io|cluster.x-k8s.io>" controllerKind="Machine" Machine="fleet-default/custom-b83dbd33166c" namespace="fleet-default" name="custom-b83dbd33166c" reconcileID=9a3ba2a5-8acb-44f7-b3 │
│ f1-581527addefa Cluster="fleet-default/cluster-a" Cluster="fleet-default/cluster-a" CustomMachine="fleet-default/custom-b83dbd33166c"                                                     │
│ I0604 17:41:16.246307       1 machine_controller_noderef.go:54] "Waiting for infrastructure provider to report spec.providerID" controller="machine" controllerGroup="<http://cluster.x-k8s.io|cluster.x-k8s.io>" c │
│ ontrollerKind="Machine" Machine="fleet-default/custom-b83dbd33166c" namespace="fleet-default" name="custom-b83dbd33166c" reconcileID=9a3ba2a5-8acb-44f7-b3f1-581527addefa Cluster="fleet- │
│ default/cluster-a" Cluster="fleet-default/cluster-a" CustomMachine="fleet-default/custom-b83dbd33166c"                                                                                    │
│ I0604 17:41:20.839385       1 machine_controller_phases.go:286] "Waiting for infrastructure provider to create machine infrastructure and report status.ready" controller="machine" contr │
│ ollerGroup="<http://cluster.x-k8s.io|cluster.x-k8s.io>" controllerKind="Machine" Machine="fleet-default/custom-b877043bafc2" namespace="fleet-default" name="custom-b877043bafc2" reconcileID=e4bc80ca-afe1-41e7-b7 │
│ 61-6808e56ba149 Cluster="fleet-default/cluster-a" Cluster="fleet-default/cluster-a" CustomMachine="fleet-default/custom-b877043bafc2"                                                     │
│ I0604 17:41:20.839420       1 machine_controller_noderef.go:54] "Waiting for infrastructure provider to report spec.providerID" controller="machine" controllerGroup="<http://cluster.x-k8s.io|cluster.x-k8s.io>" c │
│ ontrollerKind="Machine" Machine="fleet-default/custom-b877043bafc2" namespace="fleet-default" name="custom-b877043bafc2" reconcileID=e4bc80ca-afe1-41e7-b761-6808e56ba149 Cluster="fleet- │
│ default/cluster-a" Cluster="fleet-default/cluster-a" CustomMachine="fleet-default/custom-b877043bafc2"
c
have you looked at the cluster-agent deployment in the downstream cluster to see why it’s not connecting back to Rancher?
a
log-agent.txt
Hi @creamy-pencil-82913 I found these logs.
c
That's not the correct thing. Those are logs from the rancher-system-agent service on a node. You need to use kubectl on that node to look at the cluster agent pod logs. Cluster agent runs in the cluster, not as a service on the node itself.
a
PODs
Copy code
root ~ # /var/lib/rancher/rke2/bin/kubectl --kubeconfig /etc/rancher/rke2/rke2.yaml get pods --all-namespaces
NAMESPACE         NAME                                                                      READY   STATUS      RESTARTS   AGE
calico-system     calico-kube-controllers-6c85f4c55d-f4nrf                                  0/1     Pending     0          2d1h
calico-system     calico-node-kt5bw                                                         0/1     Running     0          2d1h
calico-system     calico-typha-65546dc56-vzpv8                                              0/1     Pending     0          2d1h
cattle-system     cattle-cluster-agent-846cd7f57d-mvx7v                                     0/1     Pending     0          2d1h
kube-system       etcd-ip-10-23-82-167.ap-northeast-1.compute.internal                      1/1     Running     0          2d1h
kube-system       helm-install-rke2-calico-28kg4                                            0/1     Completed   1          2d1h
kube-system       helm-install-rke2-calico-crd-9bm6v                                        0/1     Completed   0          2d1h
kube-system       helm-install-rke2-coredns-gvz5n                                           0/1     Completed   0          2d1h
kube-system       helm-install-rke2-ingress-nginx-jt9cw                                     0/1     Pending     0          2d1h
kube-system       helm-install-rke2-metrics-server-qg5fl                                    0/1     Pending     0          2d1h
kube-system       helm-install-rke2-snapshot-controller-crd-vtkkp                           0/1     Pending     0          2d1h
kube-system       helm-install-rke2-snapshot-controller-p7p5c                               0/1     Pending     0          2d1h
kube-system       helm-install-rke2-snapshot-validation-webhook-sqmw4                       0/1     Pending     0          2d1h
kube-system       kube-apiserver-ip-10-23-82-167.ap-northeast-1.compute.internal            1/1     Running     0          2d1h
kube-system       kube-controller-manager-ip-10-23-82-167.ap-northeast-1.compute.internal   1/1     Running     0          2d1h
kube-system       kube-proxy-ip-10-23-82-167.ap-northeast-1.compute.internal                1/1     Running     0          2d1h
kube-system       kube-scheduler-ip-10-23-82-167.ap-northeast-1.compute.internal            1/1     Running     0          2d1h
kube-system       rke2-coredns-rke2-coredns-84b9cb946c-jhmr7                                0/1     Pending     0          2d1h
kube-system       rke2-coredns-rke2-coredns-autoscaler-b49765765-r4xrx                      0/1     Pending     0          2d1h
tigera-operator   tigera-operator-795545875-tvsdh                                           1/1     Running     0          2d1h
POD
cattle-cluster-agent-846cd7f57d-mvx7v
doesn't return logs.
The same goes for the others in Pending.
Events
Copy code
/var/lib/rancher/rke2/bin/kubectl --kubeconfig /etc/rancher/rke2/rke2.yaml get events --all-namespaces
NAMESPACE       LAST SEEN   TYPE      REASON                OBJECT                                                     MESSAGE
calico-system   31m         Warning   FailedScheduling      pod/calico-kube-controllers-6c85f4c55d-f4nrf               0/1 nodes are available: 1 node(s) had untolerated taint {<http://node.cloudprovider.kubernetes.io/uninitialized|node.cloudprovider.kubernetes.io/uninitialized>: true}. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling..
calico-system   2m18s       Warning   Unhealthy             pod/calico-node-kt5bw                                      Readiness probe failed: calico/node is not ready: felix is not ready: readiness probe reporting 503
calico-system   31m         Warning   FailedScheduling      pod/calico-typha-65546dc56-vzpv8                           0/1 nodes are available: 1 node(s) had untolerated taint {<http://node.cloudprovider.kubernetes.io/uninitialized|node.cloudprovider.kubernetes.io/uninitialized>: true}. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling..
cattle-system   31m         Warning   FailedScheduling      pod/cattle-cluster-agent-846cd7f57d-mvx7v                  0/1 nodes are available: 1 node(s) had untolerated taint {<http://node.cloudprovider.kubernetes.io/uninitialized|node.cloudprovider.kubernetes.io/uninitialized>: true}. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling..
kube-system     3m3s        Warning   ParseManifestFailed   addon/addons                                               Parse manifest at "/var/lib/rancher/rke2/server/manifests/rancher/addons.yaml" failed: yaml:line 2: mapping values are not allowed in this context
kube-system     3m3s        Normal    ApplyingManifest      addon/cluster-agent                                        Applying manifest at "/var/lib/rancher/rke2/server/manifests/rancher/cluster-agent.yaml"
kube-system     31m         Warning   FailedScheduling      pod/helm-install-rke2-ingress-nginx-jt9cw                  0/1 nodes are available: 1 node(s) had untolerated taint {<http://node-role.kubernetes.io/control-plane|node-role.kubernetes.io/control-plane>: }. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling..
kube-system     31m         Warning   FailedScheduling      pod/helm-install-rke2-metrics-server-qg5fl                 0/1 nodes are available: 1 node(s) had untolerated taint {<http://node-role.kubernetes.io/control-plane|node-role.kubernetes.io/control-plane>: }. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling..
kube-system     31m         Warning   FailedScheduling      pod/helm-install-rke2-snapshot-controller-crd-vtkkp        0/1 nodes are available: 1 node(s) had untolerated taint {<http://node-role.kubernetes.io/control-plane|node-role.kubernetes.io/control-plane>: }. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling..
kube-system     31m         Warning   FailedScheduling      pod/helm-install-rke2-snapshot-controller-p7p5c            0/1 nodes are available: 1 node(s) had untolerated taint {<http://node-role.kubernetes.io/control-plane|node-role.kubernetes.io/control-plane>: }. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling..
kube-system     31m         Warning   FailedScheduling      pod/helm-install-rke2-snapshot-validation-webhook-sqmw4    0/1 nodes are available: 1 node(s) had untolerated taint {<http://node-role.kubernetes.io/control-plane|node-role.kubernetes.io/control-plane>: }. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling..
kube-system     3m3s        Normal    ApplyingManifest      addon/managed-chart-config                                 Applying manifest at "/var/lib/rancher/rke2/server/manifests/rancher/managed-chart-config.yaml"
kube-system     3m3s        Normal    ApplyingManifest      addon/rke2-calico-crd                                      Applying manifest at "/var/lib/rancher/rke2/server/manifests/rke2-calico-crd.yaml"
kube-system     3m3s        Normal    ApplyingManifest      addon/rke2-calico                                          Applying manifest at "/var/lib/rancher/rke2/server/manifests/rke2-calico.yaml"
kube-system     31m         Warning   FailedScheduling      pod/rke2-coredns-rke2-coredns-84b9cb946c-jhmr7             0/1 nodes are available: 1 node(s) had untolerated taint {<http://node.cloudprovider.kubernetes.io/uninitialized|node.cloudprovider.kubernetes.io/uninitialized>: true}. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling..
kube-system     32m         Warning   FailedScheduling      pod/rke2-coredns-rke2-coredns-autoscaler-b49765765-r4xrx   0/1 nodes are available: 1 node(s) had untolerated taint {<http://node.cloudprovider.kubernetes.io/uninitialized|node.cloudprovider.kubernetes.io/uninitialized>: true}. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling..
kube-system     3m3s        Normal    ApplyingManifest      addon/rke2-coredns                                         Applying manifest at "/var/lib/rancher/rke2/server/manifests/rke2-coredns.yaml"
kube-system     3m3s        Normal    ApplyingManifest      addon/rke2-etcd-snapshot-extra-metadata                    Applying manifest at "/var/lib/rancher/rke2/server/manifests/rancher/rke2-etcd-snapshot-extra-metadata.yaml"
kube-system     3m3s        Normal    ApplyingManifest      addon/rke2-ingress-nginx                                   Applying manifest at "/var/lib/rancher/rke2/server/manifests/rke2-ingress-nginx.yaml"
kube-system     3m3s        Normal    ApplyingManifest      addon/rke2-metrics-server                                  Applying manifest at "/var/lib/rancher/rke2/server/manifests/rke2-metrics-server.yaml"
kube-system     3m2s        Normal    ApplyingManifest      addon/rke2-snapshot-controller-crd                         Applying manifest at "/var/lib/rancher/rke2/server/manifests/rke2-snapshot-controller-crd.yaml"
kube-system     3m2s        Normal    ApplyingManifest      addon/rke2-snapshot-controller                             Applying manifest at "/var/lib/rancher/rke2/server/manifests/rke2-snapshot-controller.yaml"
kube-system     3m2s        Normal    ApplyingManifest      addon/rke2-snapshot-validation-webhook                     Applying manifest at "/var/lib/rancher/rke2/server/manifests/rke2-snapshot-validation-webhook.yaml"
Describe pod agent
Copy code
root ~ # /var/lib/rancher/rke2/bin/kubectl --kubeconfig /etc/rancher/rke2/rke2.yaml describe pods  cattle-cluster-agent-846cd7f57d-mvx7v -n cattle-system
Name:             cattle-cluster-agent-846cd7f57d-mvx7v
Namespace:        cattle-system
Priority:         0
Service Account:  cattle
Node:             <none>
Labels:           app=cattle-cluster-agent
                  pod-template-hash=846cd7f57d
Annotations:      <none>
Status:           Pending
IP:
IPs:              <none>
Controlled By:    ReplicaSet/cattle-cluster-agent-846cd7f57d
Containers:
  cluster-register:
    Image:      rancher/rancher-agent:v2.8.4
    Port:       <none>
    Host Port:  <none>
    Environment:
      CATTLE_FEATURES:           embedded-cluster-api=false,fleet=false,monitoringv1=false,multi-cluster-management=false,multi-cluster-management-agent=true,provisioningv2=false,rke2=false
      CATTLE_IS_RKE:             false
      CATTLE_SERVER:             <https://rancher.xxxxxx.xxxxx.net>
      CATTLE_CA_CHECKSUM:
      CATTLE_CLUSTER:            true
      CATTLE_K8S_MANAGED:        true
      CATTLE_CLUSTER_REGISTRY:
      CATTLE_SERVER_VERSION:     v2.8.4
      CATTLE_INSTALL_UUID:       4b53d6da-bcc0-4370-b372-9aca74caf58a
      CATTLE_INGRESS_IP_DOMAIN:  <http://sslip.io|sslip.io>
    Mounts:
      /cattle-credentials from cattle-credentials (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-wnr6f (ro)
Conditions:
  Type           Status
  PodScheduled   False
Volumes:
  cattle-credentials:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  cattle-credentials-0084dc4
    Optional:    false
  kube-api-access-wnr6f:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 <http://node-role.kubernetes.io/control-plane:NoSchedule|node-role.kubernetes.io/control-plane:NoSchedule>
                             <http://node-role.kubernetes.io/etcd:NoExecute|node-role.kubernetes.io/etcd:NoExecute>
                             <http://node.kubernetes.io/not-ready:NoExecute|node.kubernetes.io/not-ready:NoExecute> op=Exists for 300s
                             <http://node.kubernetes.io/unreachable:NoExecute|node.kubernetes.io/unreachable:NoExecute> op=Exists for 300s
Events:
  Type     Reason            Age                   From               Message
  ----     ------            ----                  ----               -------
  Warning  FailedScheduling  29m (x593 over 2d1h)  default-scheduler  0/1 nodes are available: 1 node(s) had untolerated taint {<http://node.cloudprovider.kubernetes.io/uninitialized|node.cloudprovider.kubernetes.io/uninitialized>: true}. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling..
c
It looks like the aws cloud provider hasn’t been deployed properly. The nodes are still tainted with
<http://node.cloudprovider.kubernetes.io/uninitialized|node.cloudprovider.kubernetes.io/uninitialized>: true
The logs also indicate you have a typo in the manifest you are attempting to use to deploy the aws cloud provider:
Copy code
kube-system     3m3s        Warning   ParseManifestFailed   addon/addons                                               Parse manifest at "/var/lib/rancher/rke2/server/manifests/rancher/addons.yaml" failed: yaml:line 2: mapping values are not allowed in this context
Copy code
additional_manifest = <<EOF
  apiVersion: <http://helm.cattle.io/v1|helm.cattle.io/v1>
        kind: HelmChart
        metadata:
          name: aws-cloud-controller-manager
          namespace: kube-system
        spec:
          chart: aws-cloud-controller-manager
          repo: <https://kubernetes.github.io/cloud-provider-aws>
          targetNamespace: kube-system
          bootstrap: true
          valuesContent: |-
            hostNetworking: true
            nodeSelector:
              <http://node-role.kubernetes.io/control-plane|node-role.kubernetes.io/control-plane>: "true"
            args:
              - --configure-cloud-routes=false
              - --v=5
              - --cloud-provider=aws    
  EOF
I suspect that the indentation here is all wrong. yaml is very sensitive to indentation in order to get the correct structure.
apiVersion/kind/metadata/spec should all be indented at the same level
a
Let me check ...
The error with indentation was solved.
Now ...
Copy code
root /etc/rancher/rke2 # /var/lib/rancher/rke2/bin/kubectl --kubeconfig rke2.yaml events
LAST SEEN           TYPE      REASON                           OBJECT                                                                                                MESSAGE
45m                 Normal    Starting                         Node/ip-10-23-69-245.ap-northeast-1.compute.internal                                                  Starting kubelet.
45m                 Warning   InvalidDiskCapacity              Node/ip-10-23-69-245.ap-northeast-1.compute.internal                                                  invalid capacity 0 on image filesystem
45m                 Normal    NodeAllocatableEnforced          Node/ip-10-23-69-245.ap-northeast-1.compute.internal                                                  Updated Node Allocatable limit across pods
45m (x7 over 45m)   Normal    NodeHasNoDiskPressure            Node/ip-10-23-69-245.ap-northeast-1.compute.internal                                                  Node ip-10-23-69-245.ap-northeast-1.compute.internal status is now: NodeHasNoDiskPressure
45m (x7 over 45m)   Normal    NodeHasSufficientPID             Node/ip-10-23-69-245.ap-northeast-1.compute.internal                                                  Node ip-10-23-69-245.ap-northeast-1.compute.internal status is now: NodeHasSufficientPID
45m (x8 over 45m)   Normal    NodeHasSufficientMemory          Node/ip-10-23-69-245.ap-northeast-1.compute.internal                                                  Node ip-10-23-69-245.ap-northeast-1.compute.internal status is now: NodeHasSufficientMemory
44m                 Normal    Starting                         Node/ip-10-23-69-245.ap-northeast-1.compute.internal
44m                 Normal    NodePasswordValidationComplete   Node/ip-10-23-69-245.ap-northeast-1.compute.internal                                                  Deferred node password secret validation complete
44m                 Normal    ETCDSnapshotCreated              ETCDSnapshotFile/s3-etcd-snapshot-ip-10-23-67-98.ap-northeast-1.compute.internal-1717358401-be9701    Snapshot etcd-snapshot-ip-10-23-67-98.ap-northeast-1.compute.internal-1717358401 saved on ip-10-23-67-98.ap-northeast-1.compute.internal
44m                 Normal    ETCDSnapshotCreated              ETCDSnapshotFile/s3-etcd-snapshot-ip-10-23-82-167.ap-northeast-1.compute.internal-1717477202-af30c1   Snapshot etcd-snapshot-ip-10-23-82-167.ap-northeast-1.compute.internal-1717477202 saved on ip-10-23-82-167.ap-northeast-1.compute.internal
44m                 Normal    ETCDSnapshotCreated              ETCDSnapshotFile/s3-etcd-snapshot-ip-10-23-67-98.ap-northeast-1.compute.internal-1717200005-284566    Snapshot etcd-snapshot-ip-10-23-67-98.ap-northeast-1.compute.internal-1717200005 saved on ip-10-23-67-98.ap-northeast-1.compute.internal
44m                 Normal    ETCDSnapshotCreated              ETCDSnapshotFile/s3-etcd-snapshot-ip-10-23-82-167.ap-northeast-1.compute.internal-1717459203-13a315   Snapshot etcd-snapshot-ip-10-23-82-167.ap-northeast-1.compute.internal-1717459203 saved on ip-10-23-82-167.ap-northeast-1.compute.internal
44m                 Normal    ETCDSnapshotCreated              ETCDSnapshotFile/s3-etcd-snapshot-ip-10-23-67-98.ap-northeast-1.compute.internal-1717408805-38883d    Snapshot etcd-snapshot-ip-10-23-67-98.ap-northeast-1.compute.internal-1717408805 saved on ip-10-23-67-98.ap-northeast-1.compute.internal
44m                 Normal    ETCDSnapshotCreated              ETCDSnapshotFile/s3-etcd-snapshot-ip-10-23-67-98.ap-northeast-1.compute.internal-1717390804-9bc5d3    Snapshot etcd-snapshot-ip-10-23-67-98.ap-northeast-1.compute.internal-1717390804 saved on ip-10-23-67-98.ap-northeast-1.compute.internal
44m                 Normal    ETCDSnapshotCreated              ETCDSnapshotFile/s3-etcd-snapshot-ip-10-23-82-167.ap-northeast-1.compute.internal-1717495201-0d51f6   Snapshot etcd-snapshot-ip-10-23-82-167.ap-northeast-1.compute.internal-1717495201 saved on ip-10-23-82-167.ap-northeast-1.compute.internal
44m                 Normal    ETCDSnapshotCreated              ETCDSnapshotFile/s3-etcd-snapshot-ip-10-23-67-98.ap-northeast-1.compute.internal-1717185602-e177c4    Snapshot etcd-snapshot-ip-10-23-67-98.ap-northeast-1.compute.internal-1717185602 saved on ip-10-23-67-98.ap-northeast-1.compute.internal
44m                 Normal    ETCDSnapshotCreated              ETCDSnapshotFile/s3-etcd-snapshot-ip-10-23-82-167.ap-northeast-1.compute.internal-1717599602-88f715   Snapshot etcd-snapshot-ip-10-23-82-167.ap-northeast-1.compute.internal-1717599602 saved on ip-10-23-82-167.ap-northeast-1.compute.internal
44m                 Normal    ETCDSnapshotCreated              ETCDSnapshotFile/s3-etcd-snapshot-ip-10-23-67-98.ap-northeast-1.compute.internal-1717304400-5d6b09    Snapshot etcd-snapshot-ip-10-23-67-98.ap-northeast-1.compute.internal-1717304400 saved on ip-10-23-67-98.ap-northeast-1.compute.internal
44m                 Normal    ETCDSnapshotCreated              ETCDSnapshotFile/s3-etcd-snapshot-ip-10-23-67-98.ap-northeast-1.compute.internal-1717372805-604b28    Snapshot etcd-snapshot-ip-10-23-67-98.ap-northeast-1.compute.internal-1717372805 saved on ip-10-23-67-98.ap-northeast-1.compute.internal
44m                 Normal    ETCDSnapshotCreated              ETCDSnapshotFile/s3-etcd-snapshot-ip-10-23-82-167.ap-northeast-1.compute.internal-1717581600-479ce4   Snapshot etcd-snapshot-ip-10-23-82-167.ap-northeast-1.compute.internal-1717581600 saved on ip-10-23-82-167.ap-northeast-1.compute.internal
44m                 Normal    ETCDSnapshotCreated              ETCDSnapshotFile/s3-etcd-snapshot-ip-10-23-67-98.ap-northeast-1.compute.internal-1717340403-374415    Snapshot etcd-snapshot-ip-10-23-67-98.ap-northeast-1.compute.internal-1717340403 saved on ip-10-23-67-98.ap-northeast-1.compute.internal
44m                 Normal    ETCDSnapshotCreated              ETCDSnapshotFile/s3-etcd-snapshot-ip-10-23-82-167.ap-northeast-1.compute.internal-1717563600-ecff11   Snapshot etcd-snapshot-ip-10-23-82-167.ap-northeast-1.compute.internal-1717563600 saved on ip-10-23-82-167.ap-northeast-1.compute.internal
44m                 Normal    ETCDSnapshotCreated              ETCDSnapshotFile/s3-etcd-snapshot-ip-10-23-82-167.ap-northeast-1.compute.internal-1717545605-2d6ba2   Snapshot etcd-snapshot-ip-10-23-82-167.ap-northeast-1.compute.internal-1717545605 saved on ip-10-23-82-167.ap-northeast-1.compute.internal
44m                 Normal    ETCDSnapshotCreated              ETCDSnapshotFile/s3-etcd-snapshot-ip-10-23-67-98.ap-northeast-1.compute.internal-1717236003-72026d    Snapshot etcd-snapshot-ip-10-23-67-98.ap-northeast-1.compute.internal-1717236003 saved on ip-10-23-67-98.ap-northeast-1.compute.internal
44m                 Normal    ETCDSnapshotCreated              ETCDSnapshotFile/s3-etcd-snapshot-ip-10-23-67-98.ap-northeast-1.compute.internal-1717272003-10aebc    Snapshot etcd-snapshot-ip-10-23-67-98.ap-northeast-1.compute.internal-1717272003 saved on ip-10-23-67-98.ap-northeast-1.compute.internal
44m                 Normal    ETCDSnapshotCreated              ETCDSnapshotFile/s3-etcd-snapshot-ip-10-23-82-167.ap-northeast-1.compute.internal-1717531205-38fbe5   Snapshot etcd-snapshot-ip-10-23-82-167.ap-northeast-1.compute.internal-1717531205 saved on ip-10-23-82-167.ap-northeast-1.compute.internal
44m                 Normal    ETCDSnapshotCreated              ETCDSnapshotFile/s3-etcd-snapshot-ip-10-23-82-167.ap-northeast-1.compute.internal-1717513202-899923   Snapshot etcd-snapshot-ip-10-23-82-167.ap-northeast-1.compute.internal-1717513202 saved on ip-10-23-82-167.ap-northeast-1.compute.internal
44m                 Normal    ETCDSnapshotCreated              ETCDSnapshotFile/s3-etcd-snapshot-ip-10-23-82-167.ap-northeast-1.compute.internal-1717426803-fd0218   Snapshot etcd-snapshot-ip-10-23-82-167.ap-northeast-1.compute.internal-1717426803 saved on ip-10-23-82-167.ap-northeast-1.compute.internal
44m                 Normal    ETCDSnapshotCreated              ETCDSnapshotFile/s3-etcd-snapshot-ip-10-23-67-98.ap-northeast-1.compute.internal-1717218003-f97e85    Snapshot etcd-snapshot-ip-10-23-67-98.ap-northeast-1.compute.internal-1717218003 saved on ip-10-23-67-98.ap-northeast-1.compute.internal
44m                 Normal    RegisteredNode                   Node/ip-10-23-69-245.ap-northeast-1.compute.internal                                                  Node ip-10-23-69-245.ap-northeast-1.compute.internal event: Registered Node ip-10-23-69-245.ap-northeast-1.compute.internal in Controller
44m                 Normal    ETCDSnapshotCreated              ETCDSnapshotFile/s3-etcd-snapshot-ip-10-23-82-167.ap-northeast-1.compute.internal-1717444801-d9cc72   Snapshot etcd-snapshot-ip-10-23-82-167.ap-northeast-1.compute.internal-1717444801 saved on ip-10-23-82-167.ap-northeast-1.compute.internal
44m                 Normal    ETCDSnapshotCreated              ETCDSnapshotFile/s3-etcd-snapshot-ip-10-23-67-98.ap-northeast-1.compute.internal-1717286402-9e04eb    Snapshot etcd-snapshot-ip-10-23-67-98.ap-northeast-1.compute.internal-1717286402 saved on ip-10-23-67-98.ap-northeast-1.compute.internal
44m                 Normal    ETCDSnapshotCreated              ETCDSnapshotFile/s3-etcd-snapshot-ip-10-23-67-98.ap-northeast-1.compute.internal-1717322401-7785bc    Snapshot etcd-snapshot-ip-10-23-67-98.ap-northeast-1.compute.internal-1717322401 saved on ip-10-23-67-98.ap-northeast-1.compute.internal
44m                 Normal    ETCDSnapshotCreated              ETCDSnapshotFile/s3-etcd-snapshot-ip-10-23-67-98.ap-northeast-1.compute.internal-1717254003-33aebd    Snapshot etcd-snapshot-ip-10-23-67-98.ap-northeast-1.compute.internal-1717254003 saved on ip-10-23-67-98.ap-northeast-1.compute.internal
Describe
Copy code
Name:             cattle-cluster-agent-77497bd5df-llzhd
Namespace:        cattle-system
Priority:         0
Service Account:  cattle
Node:             <none>
Labels:           app=cattle-cluster-agent
                  pod-template-hash=77497bd5df
Annotations:      <none>
Status:           Pending
IP:
IPs:              <none>
Controlled By:    ReplicaSet/cattle-cluster-agent-77497bd5df
Containers:
  cluster-register:
    Image:      rancher/rancher-agent:v2.8.4
    Port:       <none>
    Host Port:  <none>
    Environment:
      CATTLE_FEATURES:           embedded-cluster-api=false,fleet=false,monitoringv1=false,multi-cluster-management=false,multi-cluster-management-agent=true,provisioningv2=false,rke2=false
      CATTLE_IS_RKE:             false
      CATTLE_SERVER:             <https://rancher.xxx.xxxxxx.net>
      CATTLE_CA_CHECKSUM:
      CATTLE_CLUSTER:            true
      CATTLE_K8S_MANAGED:        true
      CATTLE_CLUSTER_REGISTRY:
      CATTLE_SERVER_VERSION:     v2.8.4
      CATTLE_INSTALL_UUID:       4b53d6da-bcc0-4370-b372-9aca74caf58a
      CATTLE_INGRESS_IP_DOMAIN:  sslip.io
    Mounts:
      /cattle-credentials from cattle-credentials (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-r82g2 (ro)
Conditions:
  Type           Status
  PodScheduled   False
Volumes:
  cattle-credentials:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  cattle-credentials-728f371
    Optional:    false
  kube-api-access-r82g2:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node-role.kubernetes.io/control-plane:NoSchedule
                             node-role.kubernetes.io/etcd:NoExecute
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age                From               Message
  ----     ------            ----               ----               -------
  Warning  FailedScheduling  50m                default-scheduler  0/1 nodes are available: 1 node(s) had untolerated taint {node.cloudprovider.kubernetes.io/uninitialized: true}. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling..
  Warning  FailedScheduling  25m (x5 over 45m)  default-scheduler  0/1 nodes are available: 1 node(s) had untolerated taint {node.cloudprovider.kubernetes.io/uninitialized: true}. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling..
c
ok, is the cloud provider actually working? Did it get deployed, is it actually running, are there errors its pod logs?
It’s still not clearing the uninitialized taint from your nodes so I suspect there are still issues with the configuration
a
Maybe some configuration on RKE Config?
Copy code
esource "rancher2_cluster_v2" "cluster" {
  provider = rancher2.admin

  name               = var.cluster_name
  kubernetes_version = var.cluster_kubernetes_version

  rke_config {
    machine_global_config = yamlencode({
      cloud-provider-name = "aws"
      # kube-apiserver-arg = [
      #     "cloud-provider=external"
      #   ]
      # kube-controller-manager-arg = [
      #     # <https://ranchermanager.docs.rancher.com/how-to-guides/new-user-guides/kubernetes-clusters-in-rancher-setup/migrate-to-an-out-of-tree-cloud-provider/migrate-to-out-of-tree-amazon>
      #     "cloud-provider=external",
      #     "terminated-pod-gc-threshold=10",
      #     "enable-leader-migration"
      # ]
    })
    # ETCD Role
    machine_selector_config {
      machine_label_selector {
        match_expressions {
          key      = "<http://rke.cattle.io/etcd-role|rke.cattle.io/etcd-role>"
          operator = "In"
          values   = ["true"]
        }
      }
      config = yamlencode({
        # <https://docs.rke2.io/reference/server_config> 
        kubelet-arg = [
          "cloud-provider=external"
        ]
      })
    }
    # Controlplane Role
    machine_selector_config {
      machine_label_selector {
        match_expressions {
          key      = "<http://rke.cattle.io/control-plane-role|rke.cattle.io/control-plane-role>"
          operator = "In"
          values   = ["true"]
        }
      }
      config = yamlencode({
        disable-cloud-controller = "true"
        # <https://docs.rke2.io/reference/server_config> 
        kubelet-arg = [
          "cloud-provider=external"
        ]
        kube-apiserver-arg = [
          "cloud-provider=external"
        ]
        kube-controller-manager-arg = [
          # <https://ranchermanager.docs.rancher.com/how-to-guides/new-user-guides/kubernetes-clusters-in-rancher-setup/migrate-to-an-out-of-tree-cloud-provider/migrate-to-out-of-tree-amazon>
          "cloud-provider=external"
        ]
      })
    }
    # Worker Role
    machine_selector_config {
      machine_label_selector {
        match_expressions {
          key      = "<http://rke.cattle.io/worker-role|rke.cattle.io/worker-role>"
          operator = "In"
          values   = ["true"]
        }
      }
      config = yamlencode({
        # <https://docs.rke2.io/reference/server_config> 
        kubelet-arg = [
          "cloud-provider=external"
        ]
      })
    }
    upgrade_strategy {
      control_plane_concurrency = "1"
      worker_concurrency = "1"
      worker_drain_options {
        enabled = true
        delete_empty_dir_data = true
        ignore_daemon_sets = true
        disable_eviction = true
        force = true
      }
      control_plane_drain_options {
        enabled = true
        delete_empty_dir_data = true
        ignore_daemon_sets = true
        disable_eviction = true
        force = true
      }
    }
    etcd {
      snapshot_schedule_cron = var.snapshot_schedule_cron
      snapshot_retention = var.snapshot_retention
      s3_config {
        bucket    = var.bucket_etcd_bkp_name
        endpoint  = "<http://s3.amazonaws.com|s3.amazonaws.com>"
        folder    = "${var.cluster_name}-etcd-backup"
        region    = data.aws_s3_bucket.selected.region
      }
    }
    additional_manifest = <<EOF
apiVersion: <http://helm.cattle.io/v1|helm.cattle.io/v1>
kind: HelmChart
metadata:
  name: aws-cloud-controller-manager
  namespace: kube-system
spec:
  chart: aws-cloud-controller-manager
  repo: <https://kubernetes.github.io/cloud-provider-aws>
  targetNamespace: kube-system
  bootstrap: true
  valuesContent: |-
    hostNetworking: true
    nodeSelector:
      <http://node-role.kubernetes.io/control-plane|node-role.kubernetes.io/control-plane>: "true"
    args:
      - --configure-cloud-routes=false
      - --v=5
      - --cloud-provider=aws    
EOF
  }
}
The helm chart for cloud provider was installed but I couldn't to find pods about it 🤔
I'll review cloud provider installation.
Copy code
oot ~ # /var/lib/rancher/rke2/bin/kubectl --kubeconfig /etc/rancher/rke2/rke2.yaml get daemonset -A
NAMESPACE       NAME                           DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                AGE
calico-system   calico-node                    1         1         0       1            0           <http://kubernetes.io/os=linux|kubernetes.io/os=linux>                       82m
kube-system     aws-cloud-controller-manager   0         0         0       0            0           <http://node-role.kubernetes.io/control-plane=true|node-role.kubernetes.io/control-plane=true>   82m
Copy code
root ~ # /var/lib/rancher/rke2/bin/kubectl --kubeconfig /etc/rancher/rke2/rke2.yaml describe daemonset aws-cloud-controller-manager -n kube-system
Name:           aws-cloud-controller-manager
Selector:       k8s-app=aws-cloud-controller-manager
Node-Selector:  <http://node-role.kubernetes.io/control-plane=true|node-role.kubernetes.io/control-plane=true>
Labels:         <http://app.kubernetes.io/managed-by=Helm|app.kubernetes.io/managed-by=Helm>
                <http://helm.sh/chart=aws-cloud-controller-manager-0.0.8|helm.sh/chart=aws-cloud-controller-manager-0.0.8>
                k8s-app=aws-cloud-controller-manager
Annotations:    deprecated.daemonset.template.generation: 1
                <http://meta.helm.sh/release-name|meta.helm.sh/release-name>: aws-cloud-controller-manager
                <http://meta.helm.sh/release-namespace|meta.helm.sh/release-namespace>: kube-system
Desired Number of Nodes Scheduled: 0
Current Number of Nodes Scheduled: 0
Number of Nodes Scheduled with Up-to-date Pods: 0
Number of Nodes Scheduled with Available Pods: 0
Number of Nodes Misscheduled: 0
Pods Status:  0 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
  Labels:           k8s-app=aws-cloud-controller-manager
  Service Account:  cloud-controller-manager
  Containers:
   aws-cloud-controller-manager:
    Image:      <http://registry.k8s.io/provider-aws/cloud-controller-manager:v1.27.1|registry.k8s.io/provider-aws/cloud-controller-manager:v1.27.1>
    Port:       <none>
    Host Port:  <none>
    Args:
      --configure-cloud-routes=false
      --v=5
      --cloud-provider=aws
    Requests:
      cpu:              200m
    Environment:        <none>
    Mounts:             <none>
  Volumes:              <none>
  Priority Class Name:  system-node-critical
Events:                 <none>
There is no dameonset for the cloud provider
c
there’s a daemonset but not pods you mean?
a
yes
c
looks like it’s set to only run on nodes that are labeled
<http://node-role.kubernetes.io/control-plane=true|node-role.kubernetes.io/control-plane=true>
which should match the server nodes
can you confirm that you have nodes in the cluster with those labels?
a
Question .... These labels are on EC2 instances?
c
no, on the nodes. in kubernetes.
a
I'll check
Copy code
root ~ # /var/lib/rancher/rke2/bin/kubectl --kubeconfig /etc/rancher/rke2/rke2.yaml describe  node ip-10-23-73-12.ap-northeast-1.compute.internal
Name:               ip-10-23-73-12.ap-northeast-1.compute.internal
Roles:              control-plane,etcd,master
Labels:             <http://beta.kubernetes.io/arch=amd64|beta.kubernetes.io/arch=amd64>
                    <http://beta.kubernetes.io/os=linux|beta.kubernetes.io/os=linux>
                    <http://cattle.io/os=linux|cattle.io/os=linux>
                    cluster=cluster-a
                    <http://kubernetes.io/arch=amd64|kubernetes.io/arch=amd64>
                    <http://kubernetes.io/hostname=ip-10-23-73-12.ap-northeast-1.compute.internal|kubernetes.io/hostname=ip-10-23-73-12.ap-northeast-1.compute.internal>
                    <http://kubernetes.io/os=linux|kubernetes.io/os=linux>
                    <http://node-role.kubernetes.io/control-plane=true|node-role.kubernetes.io/control-plane=true>
                    <http://node-role.kubernetes.io/etcd=true|node-role.kubernetes.io/etcd=true>
                    <http://node-role.kubernetes.io/master=true|node-role.kubernetes.io/master=true>
                    <http://rke.cattle.io/machine=6f770611-6351-4e0d-a312-b895926ee233|rke.cattle.io/machine=6f770611-6351-4e0d-a312-b895926ee233>
Annotations:        <http://alpha.kubernetes.io/provided-node-ip|alpha.kubernetes.io/provided-node-ip>: 10.23.73.12
                    <http://etcd.rke2.cattle.io/local-snapshots-timestamp|etcd.rke2.cattle.io/local-snapshots-timestamp>: 2024-06-05T20:00:05Z
                    <http://etcd.rke2.cattle.io/node-address|etcd.rke2.cattle.io/node-address>: 10.23.73.12
                    <http://etcd.rke2.cattle.io/node-name|etcd.rke2.cattle.io/node-name>: ip-10-23-73-12.ap-northeast-1.compute.internal-8c61dcd2
                    <http://etcd.rke2.cattle.io/s3-snapshots-timestamp|etcd.rke2.cattle.io/s3-snapshots-timestamp>: 2024-06-05T20:00:05Z
                    <http://node.alpha.kubernetes.io/ttl|node.alpha.kubernetes.io/ttl>: 0
                    <http://projectcalico.org/IPv4Address|projectcalico.org/IPv4Address>: 10.23.73.12/21
                    <http://projectcalico.org/IPv4VXLANTunnelAddr|projectcalico.org/IPv4VXLANTunnelAddr>: 10.42.163.64
                    <http://rke2.io/encryption-config-hash|rke2.io/encryption-config-hash>: start-e9f47f8b849bfbff54442466cd7e9c9cfc59a75ae2bf3c8f290f1dca79b41ea4
                    <http://rke2.io/node-args|rke2.io/node-args>:
                      ["server","--agent-token","********","--cloud-provider-name","aws","--cni","calico","--disable-cloud-controller","true","--etcd-s3","true"...
                    <http://rke2.io/node-config-hash|rke2.io/node-config-hash>: BCAD2G5ODG3OPCBUZKLUA2Y54ESL5WCRBHKNIK644J3EQJFAHZKA====
                    <http://rke2.io/node-env|rke2.io/node-env>: {}
                    <http://volumes.kubernetes.io/controller-managed-attach-detach|volumes.kubernetes.io/controller-managed-attach-detach>: true
CreationTimestamp:  Wed, 05 Jun 2024 19:33:50 +0000
Taints:             <http://node-role.kubernetes.io/etcd:NoExecute|node-role.kubernetes.io/etcd:NoExecute>
                    <http://node-role.kubernetes.io/control-plane:NoSchedule|node-role.kubernetes.io/control-plane:NoSchedule>
                    <http://node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule|node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule>
Unschedulable:      false
Lease:
  HolderIdentity:  ip-10-23-73-12.ap-northeast-1.compute.internal
  AcquireTime:     <unset>
  RenewTime:       Wed, 05 Jun 2024 20:03:37 +0000
Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                 ------  -----------------                 ------------------                ------                       -------
  NetworkUnavailable   False   Wed, 05 Jun 2024 19:35:14 +0000   Wed, 05 Jun 2024 19:35:14 +0000   CalicoIsUp                   Calico is running on this node
  EtcdIsVoter          True    Wed, 05 Jun 2024 19:59:08 +0000   Wed, 05 Jun 2024 19:34:08 +0000   MemberNotLearner             Node is a voting member of the etcd cluster
  MemoryPressure       False   Wed, 05 Jun 2024 20:01:04 +0000   Wed, 05 Jun 2024 19:33:50 +0000   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   Wed, 05 Jun 2024 20:01:04 +0000   Wed, 05 Jun 2024 19:33:50 +0000   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure          False   Wed, 05 Jun 2024 20:01:04 +0000   Wed, 05 Jun 2024 19:33:50 +0000   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                True    Wed, 05 Jun 2024 20:01:04 +0000   Wed, 05 Jun 2024 19:35:03 +0000   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:  10.23.73.12
  Hostname:    ip-10-23-73-12.ap-northeast-1.compute.internal
Capacity:
  cpu:                2
  ephemeral-storage:  31444972Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             3964656Ki
  pods:               110
Allocatable:
  cpu:                2
  ephemeral-storage:  30589668738
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             3964656Ki
  pods:               110
System Info:
  Machine ID:                 ec2f1632f8badd44449ca92387028156
  System UUID:                EC2B4500-0193-3E54-7118-672DECEE6B40
  Boot ID:                    c0430239-2f6a-455e-be85-53c191b637f9
  Kernel Version:             4.14.336-257.566.amzn2.x86_64
  OS Image:                   Amazon Linux 2
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  <containerd://1.7.11-k3s2>
  Kubelet Version:            v1.28.10+rke2r1
  Kube-Proxy Version:         v1.28.10+rke2r1
PodCIDR:                      10.42.0.0/24
PodCIDRs:                     10.42.0.0/24
Non-terminated Pods:          (7 in total)
  Namespace                   Name                                                                      CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                                                      ------------  ----------  ---------------  -------------  ---
  calico-system               calico-node-mhx8g                                                         0 (0%)        0 (0%)      0 (0%)           0 (0%)         28m
  kube-system                 etcd-ip-10-23-73-12.ap-northeast-1.compute.internal                       200m (10%)    0 (0%)      512Mi (13%)      0 (0%)         29m
  kube-system                 kube-apiserver-ip-10-23-73-12.ap-northeast-1.compute.internal             250m (12%)    0 (0%)      1Gi (26%)        0 (0%)         29m
  kube-system                 kube-controller-manager-ip-10-23-73-12.ap-northeast-1.compute.internal    200m (10%)    0 (0%)      256Mi (6%)       0 (0%)         29m
  kube-system                 kube-proxy-ip-10-23-73-12.ap-northeast-1.compute.internal                 250m (12%)    0 (0%)      128Mi (3%)       0 (0%)         29m
  kube-system                 kube-scheduler-ip-10-23-73-12.ap-northeast-1.compute.internal             100m (5%)     0 (0%)      128Mi (3%)       0 (0%)         29m
  tigera-operator             tigera-operator-795545875-mhjsn                                           0 (0%)        0 (0%)      0 (0%)           0 (0%)         29m
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests   Limits
  --------           --------   ------
  cpu                1 (50%)    0 (0%)
  memory             2Gi (52%)  0 (0%)
  ephemeral-storage  0 (0%)     0 (0%)
  hugepages-1Gi      0 (0%)     0 (0%)
  hugepages-2Mi      0 (0%)     0 (0%)
Events:
  Type    Reason                          Age                From             Message
  ----    ------                          ----               ----             -------
  Normal  Starting                        29m                kube-proxy
  Normal  NodeAllocatableEnforced         30m                kubelet          Updated Node Allocatable limit across pods
  Normal  NodeHasNoDiskPressure           30m (x7 over 30m)  kubelet          Node ip-10-23-73-12.ap-northeast-1.compute.internal status is now: NodeHasNoDiskPressure
  Normal  NodeHasSufficientPID            30m (x7 over 30m)  kubelet          Node ip-10-23-73-12.ap-northeast-1.compute.internal status is now: NodeHasSufficientPID
  Normal  NodeHasSufficientMemory         30m (x8 over 30m)  kubelet          Node ip-10-23-73-12.ap-northeast-1.compute.internal status is now: NodeHasSufficientMemory
  Normal  NodePasswordValidationComplete  29m                rke2-supervisor  Deferred node password secret validation complete
  Normal  RegisteredNode                  29m                node-controller  Node ip-10-23-73-12.ap-northeast-1.compute.internal event: Registered Node ip-10-23-73-12.ap-northeast-1.compute.internal in Controller
Have the labels
c
yeah so it should try to put them there… I wonder if perhaps it needs to tolerate the taint for the daemonset controller to try to schedule it there. Can you set tolerations in the chart config?
a
Yes. We can do it
I'm reviewing all doc about cloud provider again.
Maybe I found something I forgot.
The cloud-controller-manager expected label
<http://node-role.kubernetes.io/control-plane=true|node-role.kubernetes.io/control-plane=true>
But rkeconfig on doc expected label
<http://rke.cattle.io/control-plane-role|rke.cattle.io/control-plane-role>
Is it possible this configurations are it causing issues to lauching cluster?
c
no. those docs are for RKE, not RKE2. and also that label is not in the selector on the daemonset anyway
Have you tried adding tolerations to the daemonset?
a
I follow to RKE2 indicated on documentation. 🤔
The values defaults on Cloud Controller Manager are:
Copy code
tolerations:
- key: <http://node.cloudprovider.kubernetes.io/uninitialized|node.cloudprovider.kubernetes.io/uninitialized>
  value: "true"
  effect: NoSchedule
- key: <http://node-role.kubernetes.io/control-plane|node-role.kubernetes.io/control-plane>
  value: "true"
  effect: NoSchedule
The same config suggested on documentation.
c
that looks good… I’m not sure why the deployment isn’t scheduling any pods onto your nodes.
can you get the output of
kubectl get pod -A -o wide
?
a
Copy code
root ~ # /var/lib/rancher/rke2/bin/kubectl --kubeconfig /etc/rancher/rke2/rke2.yaml get pod -A -o wide
NAMESPACE         NAME                                                                     READY   STATUS      RESTARTS   AGE   IP            NODE                                             NOMINATED NODE   READINESS GATES
calico-system     calico-kube-controllers-5f5878bc6c-dg2cq                                 0/1     Pending     0          16m   <none>        <none>                                           <none>           <none>
calico-system     calico-node-gsmnx                                                        0/1     Running     0          16m   10.23.68.70   ip-10-23-68-70.ap-northeast-1.compute.internal   <none>           <none>
calico-system     calico-typha-577847c459-vvvgh                                            0/1     Pending     0          16m   <none>        <none>                                           <none>           <none>
cattle-system     cattle-cluster-agent-9dfc7797d-qvs9p                                     0/1     Pending     0          17m   <none>        <none>                                           <none>           <none>
kube-system       etcd-ip-10-23-68-70.ap-northeast-1.compute.internal                      1/1     Running     0          17m   10.23.68.70   ip-10-23-68-70.ap-northeast-1.compute.internal   <none>           <none>
kube-system       helm-install-rke2-calico-crd-4xrmp                                       0/1     Completed   0          17m   10.23.68.70   ip-10-23-68-70.ap-northeast-1.compute.internal   <none>           <none>
kube-system       helm-install-rke2-calico-vx4k9                                           0/1     Completed   2          17m   10.23.68.70   ip-10-23-68-70.ap-northeast-1.compute.internal   <none>           <none>
kube-system       helm-install-rke2-coredns-4knlr                                          0/1     Completed   0          17m   10.23.68.70   ip-10-23-68-70.ap-northeast-1.compute.internal   <none>           <none>
kube-system       helm-install-rke2-ingress-nginx-n86v9                                    0/1     Pending     0          17m   <none>        <none>                                           <none>           <none>
kube-system       helm-install-rke2-metrics-server-2pbcj                                   0/1     Pending     0          17m   <none>        <none>                                           <none>           <none>
kube-system       helm-install-rke2-snapshot-controller-b6zml                              0/1     Pending     0          17m   <none>        <none>                                           <none>           <none>
kube-system       helm-install-rke2-snapshot-controller-crd-vqkbj                          0/1     Pending     0          17m   <none>        <none>                                           <none>           <none>
kube-system       helm-install-rke2-snapshot-validation-webhook-xjj5h                      0/1     Pending     0          17m   <none>        <none>                                           <none>           <none>
kube-system       kube-apiserver-ip-10-23-68-70.ap-northeast-1.compute.internal            1/1     Running     0          17m   10.23.68.70   ip-10-23-68-70.ap-northeast-1.compute.internal   <none>           <none>
kube-system       kube-controller-manager-ip-10-23-68-70.ap-northeast-1.compute.internal   1/1     Running     0          17m   10.23.68.70   ip-10-23-68-70.ap-northeast-1.compute.internal   <none>           <none>
kube-system       kube-proxy-ip-10-23-68-70.ap-northeast-1.compute.internal                1/1     Running     0          17m   10.23.68.70   ip-10-23-68-70.ap-northeast-1.compute.internal   <none>           <none>
kube-system       kube-scheduler-ip-10-23-68-70.ap-northeast-1.compute.internal            1/1     Running     0          17m   10.23.68.70   ip-10-23-68-70.ap-northeast-1.compute.internal   <none>           <none>
kube-system       rke2-coredns-rke2-coredns-84b9cb946c-d66xj                               0/1     Pending     0          17m   <none>        <none>                                           <none>           <none>
kube-system       rke2-coredns-rke2-coredns-autoscaler-b49765765-dj9r9                     0/1     Pending     0          17m   <none>        <none>                                           <none>           <none>
tigera-operator   tigera-operator-795545875-bdczd                                          1/1     Running     0          16m   10.23.68.70   ip-10-23-68-70.ap-northeast-1.compute.internal   <none>           <none>
Describe pod :
Copy code
oot ~ # /var/lib/rancher/rke2/bin/kubectl --kubeconfig /etc/rancher/rke2/rke2.yaml describe pod cattle-cluster-agent-9dfc7797d-qvs9p -n cattle-system
Name:             cattle-cluster-agent-9dfc7797d-qvs9p
Namespace:        cattle-system
Priority:         0
Service Account:  cattle
Node:             <none>
Labels:           app=cattle-cluster-agent
                  pod-template-hash=9dfc7797d
Annotations:      <none>
Status:           Pending
IP:
IPs:              <none>
Controlled By:    ReplicaSet/cattle-cluster-agent-9dfc7797d
Containers:
  cluster-register:
    Image:      rancher/rancher-agent:v2.8.4
    Port:       <none>
    Host Port:  <none>
    Environment:
      CATTLE_FEATURES:           embedded-cluster-api=false,fleet=false,monitoringv1=false,multi-cluster-management=false,multi-cluster-management-agent=true,provisioningv2=false,rke2=false
      CATTLE_IS_RKE:             false
      CATTLE_SERVER:             <https://rancher.xxx.xxxxxxx.net>
      CATTLE_CA_CHECKSUM:
      CATTLE_CLUSTER:            true
      CATTLE_K8S_MANAGED:        true
      CATTLE_CLUSTER_REGISTRY:
      CATTLE_SERVER_VERSION:     v2.8.4
      CATTLE_INSTALL_UUID:       4b53d6da-bcc0-4370-b372-9aca74caf58a
      CATTLE_INGRESS_IP_DOMAIN:  sslip.io
    Mounts:
      /cattle-credentials from cattle-credentials (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-vqnmn (ro)
Conditions:
  Type           Status
  PodScheduled   False
Volumes:
  cattle-credentials:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  cattle-credentials-b92333e
    Optional:    false
  kube-api-access-vqnmn:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node-role.kubernetes.io/control-plane:NoSchedule
                             node-role.kubernetes.io/etcd:NoExecute
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age                  From               Message
  ----     ------            ----                 ----               -------
  Warning  FailedScheduling  18m                  default-scheduler  0/1 nodes are available: 1 node(s) had untolerated taint {node.cloudprovider.kubernetes.io/uninitialized: true}. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling..
  Warning  FailedScheduling  7m53s (x2 over 12m)  default-scheduler  0/1 nodes are available: 1 node(s) had untolerated taint {node.cloudprovider.kubernetes.io/uninitialized: true}. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling..
It's sounds like the taint configured is unsupported. 🤔
c
that’s not it
it’s waiting for the cloud controller to run, but for some reason the daemonset isn’t getting scheduled to the node
I don’t see a
helm-install
pod for your CCM, how did you deploy that?
a
Sorry by delay
Now I enable both configurations with additional_manifest inside RKE Config and helm install
Copy code
root ~ # /var/lib/rancher/rke2/bin/kubectl --kubeconfig /etc/rancher/rke2/rke2.yaml get pods  -A
NAMESPACE         NAME                                                                     READY   STATUS      RESTARTS   AGE
calico-system     calico-kube-controllers-bf77f77-s2bb7                                    0/1     Pending     0          14m
calico-system     calico-node-srvl2                                                        0/1     Running     0          14m
calico-system     calico-typha-5d5c97c9d6-qdvtn                                            0/1     Pending     0          14m
cattle-system     cattle-cluster-agent-69c7d48cd-c6dd9                                     0/1     Pending     0          14m
kube-system       etcd-ip-10-23-82-31.ap-northeast-1.compute.internal                      1/1     Running     0          14m
kube-system       helm-install-aws-cloud-controller-manager-xhgsk                          0/1     Completed   0          14m
kube-system       helm-install-rke2-calico-crd-4cxrn                                       0/1     Completed   0          14m
kube-system       helm-install-rke2-calico-rbs6s                                           0/1     Completed   2          14m
kube-system       helm-install-rke2-coredns-fftgf                                          0/1     Completed   0          14m
kube-system       helm-install-rke2-ingress-nginx-gxwqd                                    0/1     Pending     0          14m
kube-system       helm-install-rke2-metrics-server-srtmf                                   0/1     Pending     0          14m
kube-system       helm-install-rke2-snapshot-controller-crd-ncndj                          0/1     Pending     0          14m
kube-system       helm-install-rke2-snapshot-controller-hcxff                              0/1     Pending     0          14m
kube-system       helm-install-rke2-snapshot-validation-webhook-zbbht                      0/1     Pending     0          14m
kube-system       kube-apiserver-ip-10-23-82-31.ap-northeast-1.compute.internal            1/1     Running     0          14m
kube-system       kube-controller-manager-ip-10-23-82-31.ap-northeast-1.compute.internal   1/1     Running     0          14m
kube-system       kube-proxy-ip-10-23-82-31.ap-northeast-1.compute.internal                1/1     Running     0          14m
kube-system       kube-scheduler-ip-10-23-82-31.ap-northeast-1.compute.internal            1/1     Running     0          14m
kube-system       rke2-coredns-rke2-coredns-84b9cb946c-228k6                               0/1     Pending     0          14m
kube-system       rke2-coredns-rke2-coredns-autoscaler-b49765765-pv8vl                     0/1     Pending     0          14m
tigera-operator   tigera-operator-795545875-r65sb                                          1/1     Running     0          14m
root ~ # /var/lib/rancher/rke2/bin/kubectl --kubeconfig /etc/rancher/rke2/rke2.yaml describe pod cattle-cluster-agent-69c7d48cd-c6dd9 -n cattle-system
Name:             cattle-cluster-agent-69c7d48cd-c6dd9
Namespace:        cattle-system
Priority:         0
Service Account:  cattle
Node:             <none>
Labels:           app=cattle-cluster-agent
                  pod-template-hash=69c7d48cd
Annotations:      <none>
Status:           Pending
IP:
IPs:              <none>
Controlled By:    ReplicaSet/cattle-cluster-agent-69c7d48cd
Containers:
  cluster-register:
    Image:      rancher/rancher-agent:v2.8.4
    Port:       <none>
    Host Port:  <none>
    Environment:
      CATTLE_FEATURES:           embedded-cluster-api=false,fleet=false,monitoringv1=false,multi-cluster-management=false,multi-cluster-management-agent=true,provisioningv2=false,rke2=false
      CATTLE_IS_RKE:             false
      CATTLE_SERVER:             <https://rancher.xxxx.xxxxxx.net>
      CATTLE_CA_CHECKSUM:
      CATTLE_CLUSTER:            true
      CATTLE_K8S_MANAGED:        true
      CATTLE_CLUSTER_REGISTRY:
      CATTLE_SERVER_VERSION:     v2.8.4
      CATTLE_INSTALL_UUID:       4b53d6da-bcc0-4370-b372-9aca74caf58a
      CATTLE_INGRESS_IP_DOMAIN:  sslip.io
    Mounts:
      /cattle-credentials from cattle-credentials (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-445cd (ro)
Conditions:
  Type           Status
  PodScheduled   False
Volumes:
  cattle-credentials:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  cattle-credentials-57e3481
    Optional:    false
  kube-api-access-445cd:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node-role.kubernetes.io/control-plane:NoSchedule
                             node-role.kubernetes.io/etcd:NoExecute
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age                  From               Message
  ----     ------            ----                 ----               -------
  Warning  FailedScheduling  15m                  default-scheduler  0/1 nodes are available: 1 node(s) had untolerated taint {node.cloudprovider.kubernetes.io/uninitialized: true}. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling..
  Warning  FailedScheduling  5m10s (x2 over 10m)  default-scheduler  0/1 nodes are available: 1 node(s) had untolerated taint {node.cloudprovider.kubernetes.io/uninitialized: true}. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling..
Following Documentation
Copy code
oot ~ # /var/lib/rancher/rke2/bin/kubectl --kubeconfig /etc/rancher/rke2/rke2.yaml get pod -A -o wide
NAMESPACE         NAME                                                                     READY   STATUS      RESTARTS   AGE   IP            NODE                                             NOMINATED NODE   READINESS GATES
calico-system     calico-kube-controllers-bf77f77-s2bb7                                    0/1     Pending     0          34m   <none>        <none>                                           <none>           <none>
calico-system     calico-node-srvl2                                                        0/1     Running     0          34m   10.23.82.31   ip-10-23-82-31.ap-northeast-1.compute.internal   <none>           <none>
calico-system     calico-typha-5d5c97c9d6-qdvtn                                            0/1     Pending     0          34m   <none>        <none>                                           <none>           <none>
cattle-system     cattle-cluster-agent-69c7d48cd-c6dd9                                     0/1     Pending     0          35m   <none>        <none>                                           <none>           <none>
kube-system       etcd-ip-10-23-82-31.ap-northeast-1.compute.internal                      1/1     Running     0          34m   10.23.82.31   ip-10-23-82-31.ap-northeast-1.compute.internal   <none>           <none>
kube-system       helm-install-aws-cloud-controller-manager-xhgsk                          0/1     Completed   0          35m   10.23.82.31   ip-10-23-82-31.ap-northeast-1.compute.internal   <none>           <none>
kube-system       helm-install-rke2-calico-crd-4cxrn                                       0/1     Completed   0          35m   10.23.82.31   ip-10-23-82-31.ap-northeast-1.compute.internal   <none>           <none>
kube-system       helm-install-rke2-calico-rbs6s                                           0/1     Completed   2          35m   10.23.82.31   ip-10-23-82-31.ap-northeast-1.compute.internal   <none>           <none>
kube-system       helm-install-rke2-coredns-fftgf                                          0/1     Completed   0          35m   10.23.82.31   ip-10-23-82-31.ap-northeast-1.compute.internal   <none>           <none>
kube-system       helm-install-rke2-ingress-nginx-gxwqd                                    0/1     Pending     0          35m   <none>        <none>                                           <none>           <none>
kube-system       helm-install-rke2-metrics-server-srtmf                                   0/1     Pending     0          35m   <none>        <none>                                           <none>           <none>
kube-system       helm-install-rke2-snapshot-controller-crd-ncndj                          0/1     Pending     0          35m   <none>        <none>                                           <none>           <none>
kube-system       helm-install-rke2-snapshot-controller-hcxff                              0/1     Pending     0          35m   <none>        <none>                                           <none>           <none>
kube-system       helm-install-rke2-snapshot-validation-webhook-zbbht                      0/1     Pending     0          35m   <none>        <none>                                           <none>           <none>
kube-system       kube-apiserver-ip-10-23-82-31.ap-northeast-1.compute.internal            1/1     Running     0          34m   10.23.82.31   ip-10-23-82-31.ap-northeast-1.compute.internal   <none>           <none>
kube-system       kube-controller-manager-ip-10-23-82-31.ap-northeast-1.compute.internal   1/1     Running     0          35m   10.23.82.31   ip-10-23-82-31.ap-northeast-1.compute.internal   <none>           <none>
kube-system       kube-proxy-ip-10-23-82-31.ap-northeast-1.compute.internal                1/1     Running     0          35m   10.23.82.31   ip-10-23-82-31.ap-northeast-1.compute.internal   <none>           <none>
kube-system       kube-scheduler-ip-10-23-82-31.ap-northeast-1.compute.internal            1/1     Running     0          35m   10.23.82.31   ip-10-23-82-31.ap-northeast-1.compute.internal   <none>           <none>
kube-system       rke2-coredns-rke2-coredns-84b9cb946c-228k6                               0/1     Pending     0          34m   <none>        <none>                                           <none>           <none>
kube-system       rke2-coredns-rke2-coredns-autoscaler-b49765765-pv8vl                     0/1     Pending     0          34m   <none>        <none>                                           <none>           <none>
tigera-operator   tigera-operator-795545875-r65sb                                          1/1     Running     0          34m   10.23.82.31   ip-10-23-82-31.ap-northeast-1.compute.internal   <none>           <none>
c
can you show the aws ccm daemonset?
a
Copy code
root ~ # /var/lib/rancher/rke2/bin/kubectl --kubeconfig /etc/rancher/rke2/rke2.yaml get daemonsets -A
NAMESPACE       NAME                           DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                AGE
calico-system   calico-node                    1         1         0       1            0           <http://kubernetes.io/os=linux|kubernetes.io/os=linux>                       31m
kube-system     aws-cloud-controller-manager   0         0         0       0            0           <http://node-role.kubernetes.io/control-plane=true|node-role.kubernetes.io/control-plane=true>   32m
Copy code
root ~ # /var/lib/rancher/rke2/bin/kubectl --kubeconfig /etc/rancher/rke2/rke2.yaml describe daemonset aws-cloud-controller-manager -n kube-system
Name:           aws-cloud-controller-manager
Selector:       k8s-app=aws-cloud-controller-manager
Node-Selector:  <http://node-role.kubernetes.io/control-plane=true|node-role.kubernetes.io/control-plane=true>
Labels:         <http://app.kubernetes.io/managed-by=Helm|app.kubernetes.io/managed-by=Helm>
                <http://helm.sh/chart=aws-cloud-controller-manager-0.0.8|helm.sh/chart=aws-cloud-controller-manager-0.0.8>
                k8s-app=aws-cloud-controller-manager
Annotations:    deprecated.daemonset.template.generation: 1
                <http://meta.helm.sh/release-name|meta.helm.sh/release-name>: aws-cloud-controller-manager
                <http://meta.helm.sh/release-namespace|meta.helm.sh/release-namespace>: kube-system
Desired Number of Nodes Scheduled: 0
Current Number of Nodes Scheduled: 0
Number of Nodes Scheduled with Up-to-date Pods: 0
Number of Nodes Scheduled with Available Pods: 0
Number of Nodes Misscheduled: 0
Pods Status:  0 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
  Labels:           k8s-app=aws-cloud-controller-manager
  Service Account:  cloud-controller-manager
  Containers:
   aws-cloud-controller-manager:
    Image:      <http://registry.k8s.io/provider-aws/cloud-controller-manager:v1.27.1|registry.k8s.io/provider-aws/cloud-controller-manager:v1.27.1>
    Port:       <none>
    Host Port:  <none>
    Args:
      --configure-cloud-routes=false
      --v=5
      --cloud-provider=aws
    Requests:
      cpu:              200m
    Environment:        <none>
    Mounts:             <none>
  Volumes:              <none>
  Priority Class Name:  system-node-critical
Events:                 <none>
Copy code
root ~ # /var/lib/rancher/rke2/bin/kubectl --kubeconfig /etc/rancher/rke2/rke2.yaml get events -A
NAMESPACE       LAST SEEN   TYPE      REASON                OBJECT                                                                                                  MESSAGE
calico-system   10m         Warning   FailedScheduling      pod/calico-kube-controllers-bf77f77-s2bb7                                                               0/1 nodes are available: 1 node(s) had untolerated taint {<http://node.cloudprovider.kubernetes.io/uninitialized|node.cloudprovider.kubernetes.io/uninitialized>: true}. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling..
calico-system   51s         Warning   Unhealthy             pod/calico-node-srvl2                                                                                   Readiness probe failed: calico/node is not ready: felix is not ready: readiness probe reporting 503
calico-system   10m         Warning   FailedScheduling      pod/calico-typha-5d5c97c9d6-qdvtn                                                                       0/1 nodes are available: 1 node(s) had untolerated taint {<http://node.cloudprovider.kubernetes.io/uninitialized|node.cloudprovider.kubernetes.io/uninitialized>: true}. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling..
cattle-system   11m         Warning   FailedScheduling      pod/cattle-cluster-agent-69c7d48cd-c6dd9                                                                0/1 nodes are available: 1 node(s) had untolerated taint {<http://node.cloudprovider.kubernetes.io/uninitialized|node.cloudprovider.kubernetes.io/uninitialized>: true}. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling..
default         28m         Normal    ETCDSnapshotCreated   etcdsnapshotfile/local-etcd-snapshot-ip-10-23-82-31.ap-northeast-1.compute.internal-1717772405-6164a2   Snapshot etcd-snapshot-ip-10-23-82-31.ap-northeast-1.compute.internal-1717772405 saved on ip-10-23-82-31.ap-northeast-1.compute.internal
default         28m         Normal    ETCDSnapshotCreated   etcdsnapshotfile/s3-etcd-snapshot-ip-10-23-82-31.ap-northeast-1.compute.internal-1717772405-16aa2e      Snapshot etcd-snapshot-ip-10-23-82-31.ap-northeast-1.compute.internal-1717772405 saved on ip-10-23-82-31.ap-northeast-1.compute.internal
kube-system     11m         Warning   FailedScheduling      pod/helm-install-rke2-ingress-nginx-gxwqd                                                               0/1 nodes are available: 1 node(s) had untolerated taint {<http://node-role.kubernetes.io/control-plane|node-role.kubernetes.io/control-plane>: }. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling..
kube-system     11m         Warning   FailedScheduling      pod/helm-install-rke2-metrics-server-srtmf                                                              0/1 nodes are available: 1 node(s) had untolerated taint {<http://node-role.kubernetes.io/control-plane|node-role.kubernetes.io/control-plane>: }. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling..
kube-system     11m         Warning   FailedScheduling      pod/helm-install-rke2-snapshot-controller-crd-ncndj                                                     0/1 nodes are available: 1 node(s) had untolerated taint {<http://node-role.kubernetes.io/control-plane|node-role.kubernetes.io/control-plane>: }. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling..
kube-system     11m         Warning   FailedScheduling      pod/helm-install-rke2-snapshot-controller-hcxff                                                         0/1 nodes are available: 1 node(s) had untolerated taint {<http://node-role.kubernetes.io/control-plane|node-role.kubernetes.io/control-plane>: }. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling..
kube-system     11m         Warning   FailedScheduling      pod/helm-install-rke2-snapshot-validation-webhook-zbbht                                                 0/1 nodes are available: 1 node(s) had untolerated taint {<http://node-role.kubernetes.io/control-plane|node-role.kubernetes.io/control-plane>: }. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling..
kube-system     11m         Warning   FailedScheduling      pod/rke2-coredns-rke2-coredns-84b9cb946c-228k6                                                          0/1 nodes are available: 1 node(s) had untolerated taint {<http://node.cloudprovider.kubernetes.io/uninitialized|node.cloudprovider.kubernetes.io/uninitialized>: true}. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling..
kube-system     11m         Warning   FailedScheduling      pod/rke2-coredns-rke2-coredns-autoscaler-b49765765-pv8vl                                                0/1 nodes are available: 1 node(s) had untolerated taint {<http://node.cloudprovider.kubernetes.io/uninitialized|node.cloudprovider.kubernetes.io/uninitialized>: true}. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling..
c
I still don’t see any tolerations on there. I think you need to add tolerations to the daemonset spec. I suspect there’s a spot in the chart values for that?
a
This is the only place where tolerations are not configured.
According to the documentation
@creamy-pencil-82913 One question. There are aws-cloud-controller-manager in rkeConfig and and Helm Chart Installation from CLI. Do I need to install both? If yes, should the second one be installed on the RKE2 cluster or on the cluster I am creating, exmaple cluster-a?
@creamy-pencil-82913 I figured out the problem. I separated etcd and control plane nodes. I remembered a moment when I was installing the k3s server to deploy the rancher server and I came across this. When using an external cloud provider, there is blocking for nodes with role etcd. The cluster has been created successfully. I will carry out more tests.
Now others problems was identified, when it take a snapshot and it try edit config on cluster
About snapshot ... Same problem related https://github.com/rancher/rancher/issues/45770
b
hello! I’m getting the same issue, could you elaborate a bit more about this “blocking for nodes with role etcd”, is this a design flaw?
a
I don't think so it's a design flaw. I believe it is required when you're using an external cloud provider like my case. In my case, I had taint that blocked launching the cloud controller manager. When I separated the node roles (etcd and control plane), everything worked.
b
Got it, thanks for the reply!
👍 1
331 Views