This message was deleted Rancher Users #rke2

Join Slack

This message was deleted.

# rke2

adamant-kite-43734

01/09/2025, 7:20 PM

This message was deleted.

creamy-pencil-82913

01/09/2025, 7:30 PM

You need to add manifests so that the cloud-provider actually gets deployed when the cluster is built. The rancher cluster agent won’t come up and phone back in to Rancher until after the cloud-provider is functional.

dazzling-bird-17431

01/09/2025, 7:31 PM

you mean this section ?

Copy code

spec:
  rkeConfig:
    additionalManifest: |-
      apiVersion: <http://helm.cattle.io/v1|helm.cattle.io/v1>
      kind: HelmChart
      metadata:
        name: aws-cloud-controller-manager
        namespace: kube-system
      spec:
        chart: aws-cloud-controller-manager
        repo: <https://kubernetes.github.io/cloud-provider-aws>
        targetNamespace: kube-system
        bootstrap: true
        valuesContent: |-
          hostNetworking: true
          nodeSelector:
            <http://node-role.kubernetes.io/control-plane|node-role.kubernetes.io/control-plane>: "true"
          args:
            - --configure-cloud-routes=false
            - --v=5
            - --cloud-provider=aws

creamy-pencil-82913

01/09/2025, 7:32 PM

yeah, something like that would do it

creamy-pencil-82913

01/09/2025, 7:32 PM

confirm that’s getting deployed and working properly

dazzling-bird-17431

01/09/2025, 7:32 PM

ok, I've tested that and it didn't work. I assumed I could deploy that later on. Let me try again.

creamy-pencil-82913

01/09/2025, 7:33 PM

no. the cloud provider is critical to the nodes actually coming up and ready

dazzling-bird-17431

01/09/2025, 7:56 PM

So now the specs have this added:

Copy code

spec:
  enableNetworkPolicy: false
  kubernetesVersion: v1.31.2+rke2r1
  localClusterAuthEndpoint: {}
  rkeConfig:
    additionalManifest: |
      ---
      apiVersion: <http://helm.cattle.io/v1|helm.cattle.io/v1>
      kind: HelmChart
      metadata:
        name: aws-cloud-controller-manager
        namespace: kube-system
      spec:
        chart: aws-cloud-controller-manager
        repo: <https://kubernetes.github.io/cloud-provider-aws>
        targetNamespace: kube-system
        bootstrap: true
        valuesContent: |-
          hostNetworking: true
          nodeSelector:
            <http://node-role.kubernetes.io/control-plane|node-role.kubernetes.io/control-plane>: "true"
          args:
            - --configure-cloud-routes=false
            - --v=5
            - --cloud-provider=aws

but nodes are still stuck

creamy-pencil-82913

01/09/2025, 7:57 PM

right did you look to see if its working though?

creamy-pencil-82913

01/09/2025, 7:57 PM

Is the cloud provider pod running? Are there errors in its logs?

dazzling-bird-17431

01/09/2025, 7:57 PM

you mean directly on the ctp nodes?

creamy-pencil-82913

01/09/2025, 7:57 PM

creamy-pencil-82913

01/09/2025, 7:58 PM

Look at the pods that chart deployed. are they working?

dazzling-bird-17431

01/09/2025, 8:00 PM

ok, so I'm confused. The k8s cluster is not ready yet, how can I check if the pods are running?

creamy-pencil-82913

01/09/2025, 8:01 PM

How would you normally troubleshoot an error with a downstream cluster if you can’t get to it from Rancher?

dazzling-bird-17431

01/09/2025, 8:03 PM

I'd ssh to one of the node and run

docker ps

😉 However the docker command is not found. So there must be another tool creating those pods? note: I'm fairly new with rancher!

creamy-pencil-82913

01/09/2025, 8:04 PM

RKE2 does not use docker. It uses containerd.

creamy-pencil-82913

01/09/2025, 8:05 PM

When logged into a server node, you can use kubectl to interact with pods directly on the cluster as long as the apiserver is up. If it is not up, you can look at the logs under /var/log/pods on the individual nodes.

dazzling-bird-17431

01/09/2025, 8:08 PM

ok, so I see some logs that looks interesting:

Copy code

2025-01-09T20:06:10.449973613Z stderr F I0109 20:06:10.449861       1 node_controller.go:427] Initializing node testjs2-etcd-hqj29-xrwtr with cloud provider
2025-01-09T20:06:10.450135318Z stderr F I0109 20:06:10.449904       1 aws.go:5089] Unable to convert node name "testjs2-etcd-hqj29-xrwtr" to aws instanceID, fall back to findInstanceByNodeName: node has no providerID

dazzling-bird-17431

01/09/2025, 8:11 PM

and more logs:

Copy code

2025-01-09T20:06:10.584975828Z stderr F I0109 20:06:10.584624       1 log_handler.go:27] AWS request: ec2 DescribeInstances
2025-01-09T20:06:10.684790209Z stderr F I0109 20:06:10.684610       1 log_handler.go:32] AWS API Send: ec2 DescribeInstances &{DescribeInstances POST / 0xc000317ef0 <nil>} {
2025-01-09T20:06:10.684809928Z stderr F   Filters: [{
2025-01-09T20:06:10.684837001Z stderr F       Name: "private-dns-name",
2025-01-09T20:06:10.684840284Z stderr F       Values: ["testjs2-etcd-hqj29-9rnhn"]
2025-01-09T20:06:10.684842604Z stderr F     },{
2025-01-09T20:06:10.684845064Z stderr F       Name: "instance-state-name",
2025-01-09T20:06:10.684847463Z stderr F       Values: [
2025-01-09T20:06:10.68485013Z stderr F         "pending",
2025-01-09T20:06:10.684852596Z stderr F         "running",
2025-01-09T20:06:10.684854898Z stderr F         "shutting-down",
2025-01-09T20:06:10.684857621Z stderr F         "stopping",
2025-01-09T20:06:10.684859873Z stderr F         "stopped"
2025-01-09T20:06:10.684863461Z stderr F       ]
2025-01-09T20:06:10.684866712Z stderr F     }],
2025-01-09T20:06:10.684870468Z stderr F   MaxResults: 1000
2025-01-09T20:06:10.684872984Z stderr F }
2025-01-09T20:06:10.684875747Z stderr F I0109 20:06:10.684665       1 log_handler.go:37] AWS API ValidateResponse: ec2 DescribeInstances &{DescribeInstances POST / 0xc000317ef0 <nil>} {
2025-01-09T20:06:10.684878114Z stderr F   Filters: [{
2025-01-09T20:06:10.684880735Z stderr F       Name: "private-dns-name",
2025-01-09T20:06:10.684883207Z stderr F       Values: ["testjs2-etcd-hqj29-9rnhn"]
2025-01-09T20:06:10.684887921Z stderr F     },{
2025-01-09T20:06:10.684890294Z stderr F       Name: "instance-state-name",
2025-01-09T20:06:10.68489264Z stderr F       Values: [
2025-01-09T20:06:10.684894956Z stderr F         "pending",
2025-01-09T20:06:10.684897158Z stderr F         "running",
2025-01-09T20:06:10.684899732Z stderr F         "shutting-down",
2025-01-09T20:06:10.684901936Z stderr F         "stopping",
2025-01-09T20:06:10.684904142Z stderr F         "stopped"
2025-01-09T20:06:10.684906328Z stderr F       ]
2025-01-09T20:06:10.684908558Z stderr F     }],
2025-01-09T20:06:10.68491089Z stderr F   MaxResults: 1000
2025-01-09T20:06:10.684913103Z stderr F } 200 OK
2025-01-09T20:06:10.684978458Z stderr F E0109 20:06:10.684868       1 node_controller.go:236] error syncing 'testjs2-etcd-hqj29-9rnhn': failed to get provider ID for node testjs2-etcd-hqj29-9rnhn at cloudprovider: failed to get instance ID from cloud provider: instance not found, requeuing

creamy-pencil-82913

01/09/2025, 8:14 PM

the cloud provider needs to be able to resolve the node name to an EC2 instance ID

creamy-pencil-82913

01/09/2025, 8:16 PM

You should upgrade to v1.31.3 or newer: https://github.com/rancher/rke2/issues/7344

creamy-pencil-82913

01/09/2025, 8:16 PM

The release of RKE2 you’re using does not set the node name properly for compatibility with AWS: https://github.com/rancher/rke2/pull/7354

dazzling-bird-17431

01/09/2025, 8:17 PM

ok, changing that right away!

creamy-pencil-82913

01/09/2025, 8:17 PM

how’d you end up with 1.31.2?

dazzling-bird-17431

01/09/2025, 8:18 PM

from a terraform variable, it's the previous version I was deploying.

dazzling-bird-17431

01/09/2025, 9:08 PM

ok, so the cluster deployment worked. I will test the aws lbc now.

24 Views

Open in Slack

Previous Next