This message was deleted.
# rke2
a
This message was deleted.
c
You need to add manifests so that the cloud-provider actually gets deployed when the cluster is built. The rancher cluster agent won’t come up and phone back in to Rancher until after the cloud-provider is functional.
d
you mean this section ?
Copy code
spec:
  rkeConfig:
    additionalManifest: |-
      apiVersion: <http://helm.cattle.io/v1|helm.cattle.io/v1>
      kind: HelmChart
      metadata:
        name: aws-cloud-controller-manager
        namespace: kube-system
      spec:
        chart: aws-cloud-controller-manager
        repo: <https://kubernetes.github.io/cloud-provider-aws>
        targetNamespace: kube-system
        bootstrap: true
        valuesContent: |-
          hostNetworking: true
          nodeSelector:
            <http://node-role.kubernetes.io/control-plane|node-role.kubernetes.io/control-plane>: "true"
          args:
            - --configure-cloud-routes=false
            - --v=5
            - --cloud-provider=aws
c
yeah, something like that would do it
confirm that’s getting deployed and working properly
d
ok, I've tested that and it didn't work. I assumed I could deploy that later on. Let me try again.
c
no. the cloud provider is critical to the nodes actually coming up and ready
d
So now the specs have this added:
Copy code
spec:
  enableNetworkPolicy: false
  kubernetesVersion: v1.31.2+rke2r1
  localClusterAuthEndpoint: {}
  rkeConfig:
    additionalManifest: |
      ---
      apiVersion: <http://helm.cattle.io/v1|helm.cattle.io/v1>
      kind: HelmChart
      metadata:
        name: aws-cloud-controller-manager
        namespace: kube-system
      spec:
        chart: aws-cloud-controller-manager
        repo: <https://kubernetes.github.io/cloud-provider-aws>
        targetNamespace: kube-system
        bootstrap: true
        valuesContent: |-
          hostNetworking: true
          nodeSelector:
            <http://node-role.kubernetes.io/control-plane|node-role.kubernetes.io/control-plane>: "true"
          args:
            - --configure-cloud-routes=false
            - --v=5
            - --cloud-provider=aws
but nodes are still stuck
c
right did you look to see if its working though?
Is the cloud provider pod running? Are there errors in its logs?
d
you mean directly on the ctp nodes?
c
?
Look at the pods that chart deployed. are they working?
d
ok, so I'm confused. The k8s cluster is not ready yet, how can I check if the pods are running?
c
How would you normally troubleshoot an error with a downstream cluster if you can’t get to it from Rancher?
d
I'd ssh to one of the node and run
docker ps
😉 However the docker command is not found. So there must be another tool creating those pods? note: I'm fairly new with rancher!
c
RKE2 does not use docker. It uses containerd.
When logged into a server node, you can use kubectl to interact with pods directly on the cluster as long as the apiserver is up. If it is not up, you can look at the logs under /var/log/pods on the individual nodes.
d
ok, so I see some logs that looks interesting:
Copy code
2025-01-09T20:06:10.449973613Z stderr F I0109 20:06:10.449861       1 node_controller.go:427] Initializing node testjs2-etcd-hqj29-xrwtr with cloud provider
2025-01-09T20:06:10.450135318Z stderr F I0109 20:06:10.449904       1 aws.go:5089] Unable to convert node name "testjs2-etcd-hqj29-xrwtr" to aws instanceID, fall back to findInstanceByNodeName: node has no providerID
and more logs:
Copy code
2025-01-09T20:06:10.584975828Z stderr F I0109 20:06:10.584624       1 log_handler.go:27] AWS request: ec2 DescribeInstances
2025-01-09T20:06:10.684790209Z stderr F I0109 20:06:10.684610       1 log_handler.go:32] AWS API Send: ec2 DescribeInstances &{DescribeInstances POST / 0xc000317ef0 <nil>} {
2025-01-09T20:06:10.684809928Z stderr F   Filters: [{
2025-01-09T20:06:10.684837001Z stderr F       Name: "private-dns-name",
2025-01-09T20:06:10.684840284Z stderr F       Values: ["testjs2-etcd-hqj29-9rnhn"]
2025-01-09T20:06:10.684842604Z stderr F     },{
2025-01-09T20:06:10.684845064Z stderr F       Name: "instance-state-name",
2025-01-09T20:06:10.684847463Z stderr F       Values: [
2025-01-09T20:06:10.68485013Z stderr F         "pending",
2025-01-09T20:06:10.684852596Z stderr F         "running",
2025-01-09T20:06:10.684854898Z stderr F         "shutting-down",
2025-01-09T20:06:10.684857621Z stderr F         "stopping",
2025-01-09T20:06:10.684859873Z stderr F         "stopped"
2025-01-09T20:06:10.684863461Z stderr F       ]
2025-01-09T20:06:10.684866712Z stderr F     }],
2025-01-09T20:06:10.684870468Z stderr F   MaxResults: 1000
2025-01-09T20:06:10.684872984Z stderr F }
2025-01-09T20:06:10.684875747Z stderr F I0109 20:06:10.684665       1 log_handler.go:37] AWS API ValidateResponse: ec2 DescribeInstances &{DescribeInstances POST / 0xc000317ef0 <nil>} {
2025-01-09T20:06:10.684878114Z stderr F   Filters: [{
2025-01-09T20:06:10.684880735Z stderr F       Name: "private-dns-name",
2025-01-09T20:06:10.684883207Z stderr F       Values: ["testjs2-etcd-hqj29-9rnhn"]
2025-01-09T20:06:10.684887921Z stderr F     },{
2025-01-09T20:06:10.684890294Z stderr F       Name: "instance-state-name",
2025-01-09T20:06:10.68489264Z stderr F       Values: [
2025-01-09T20:06:10.684894956Z stderr F         "pending",
2025-01-09T20:06:10.684897158Z stderr F         "running",
2025-01-09T20:06:10.684899732Z stderr F         "shutting-down",
2025-01-09T20:06:10.684901936Z stderr F         "stopping",
2025-01-09T20:06:10.684904142Z stderr F         "stopped"
2025-01-09T20:06:10.684906328Z stderr F       ]
2025-01-09T20:06:10.684908558Z stderr F     }],
2025-01-09T20:06:10.68491089Z stderr F   MaxResults: 1000
2025-01-09T20:06:10.684913103Z stderr F } 200 OK
2025-01-09T20:06:10.684978458Z stderr F E0109 20:06:10.684868       1 node_controller.go:236] error syncing 'testjs2-etcd-hqj29-9rnhn': failed to get provider ID for node testjs2-etcd-hqj29-9rnhn at cloudprovider: failed to get instance ID from cloud provider: instance not found, requeuing
c
the cloud provider needs to be able to resolve the node name to an EC2 instance ID
You should upgrade to v1.31.3 or newer: https://github.com/rancher/rke2/issues/7344
The release of RKE2 you’re using does not set the node name properly for compatibility with AWS: https://github.com/rancher/rke2/pull/7354
d
ok, changing that right away!
c
how’d you end up with 1.31.2?
d
from a terraform variable, it's the previous version I was deploying.
ok, so the cluster deployment worked. I will test the aws lbc now.