This message was deleted Rancher Users #kubernetes

Join Slack

This message was deleted.

# kubernetes

adamant-kite-43734

12/19/2023, 10:13 PM

This message was deleted.

creamy-pencil-82913

12/19/2023, 10:16 PM

the cloud provider needs to be deployed during cluster provisioning, as the rancher cluster agent won’t be able to run until that’s done. That effectively means that you need to put EVERYTHING necessary to get the out-of-tree cloud provider working properly in a user manifest, so that it is installed during cluster bootstrap.

abundant-hair-58573

12/19/2023, 10:21 PM

Ah, interesting. I added the cloud-controller in the addition-manifests section of the cluster config and thought I got all of the tags right. Running

crictl ps -a

on the controlplane shows the aws-cloud-controller-manager with a bunch of failed restarts, last line in the log shows that I must have an old tag lying around somehere

Copy code

Cloud provider could not be initialized: could not init cloud provider "aws": Found multiple cluster tags with prefix <http://kubernetes.io/cluster/|kubernetes.io/cluster/>

creamy-pencil-82913

12/19/2023, 10:22 PM

ahh sounds like you’re close!

abundant-hair-58573

12/19/2023, 10:23 PM

it looks like that's from trying to describe ec2 instances, so does that mean ANY instance in my VPC with multiple cluster tags will cause that problem? I assumed I'd just need to fix the instances I added to the cluster, but I guess that makes sense it'd look at all of them

creamy-pencil-82913

12/19/2023, 10:24 PM

I’d defer to the upstream docs on how they expect things to be tagged

creamy-pencil-82913

12/19/2023, 10:25 PM

In my experience it is fine to have multiple clusters in the same VPC, you just need to make sure you set up the tags and the config to match, and be unique per cluster.

abundant-hair-58573

12/19/2023, 10:28 PM

ok cool that makes sense, thank you. Aside from that, all of the steps in this section should be all I need? The only other thing I had to do different was the hostname-override because our naming conventions differ form hostname.ec.internal

Copy code

machineGlobalConfig:
      cni: canal
      disable-kube-proxy: false
      etcd-expose-metrics: false
      kube-apiserver-arg:
        - cloud-provider=external
      kube-controller-manager-arg:
        - cloud-provider=external
      kube-proxy-arg:
        - '--hostname-override="$(hostname -f)"'
      kube-scheduler-arg: []
    machinePools: null
    machineSelectorConfig:
      - config:
          cloud-provider-name: aws
          kubelet-arg:
            - '--hostname-override="$(hostname -f)"'

abundant-hair-58573

12/19/2023, 10:29 PM

I wish we could just configure it straight up with aws as the cloud provider but we can't get static AWS credentials in our environment

creamy-pencil-82913

12/19/2023, 10:32 PM

- '--hostname-override="$(hostname -f)"'

does that… work? I wouldn’t have expected shell expressions to be supported like that.

creamy-pencil-82913

12/19/2023, 10:32 PM

if it does, cool lol

creamy-pencil-82913

12/19/2023, 10:40 PM

yeahhhh I don’t think that’ll work

creamy-pencil-82913

12/19/2023, 10:41 PM

it doesn’t get expanded by the shell. It is just run directly.

abundant-hair-58573

12/19/2023, 10:41 PM

I have no idea, I tried getting this working in our RKE1 cluster about a year ago, I've been looking through my old tickets today and I noted in one of my tickets that someone in this slack suggested doing that. Fingers crossed haha I wasn't able to find good documentation on how to set those overrides. I see plenty of docs that say to do it for the kubelet on all nodes and the kube-proxy, but couldn't find the exact syntax anywhere. Actually I think it might not be working or there something else I have to set, because the cloud-controller-manager is working now that I cleaned up those three tags, but I'm seeing this in the log

Copy code

error syncing '<ip_address>our.domain': failed to get provider ID for node <ip_address>.our.domain at cloudprovider: failed to get instance ID from cloud provider: instance not found, requeuing

creamy-pencil-82913

12/19/2023, 10:41 PM

Copy code

spec:
  containers:
  - args:
    - --cluster-cidr=10.42.0.0/16
    - --conntrack-max-per-core=0
    - --conntrack-tcp-timeout-close-wait=0s
    - --conntrack-tcp-timeout-established=0s
    - --healthz-bind-address=127.0.0.1
    - --hostname-override="$(hostname -f)"
    - --kubeconfig=/var/lib/rancher/rke2/agent/kubeproxy.kubeconfig
    - --proxy-mode=iptables
    command:
    - kube-proxy

I get this:

Copy code

root@rke2-server-1:/# cat /var/log/pods/kube-system_kube-proxy-rke2-server-1_e778576655760f9a1185c3599f8f57f0/kube-proxy/0.log
2023-12-19T22:38:09.257758437Z stderr F E1219 22:38:09.257613       1 server.go:1039] "Failed to retrieve node info" err="nodes \"\\\"$(hostname -f)\\\"\" not found"

abundant-hair-58573

12/19/2023, 10:41 PM

bummer. Any idea on how to set that dynamically? Or does it need to go in each nodes user data script?

creamy-pencil-82913

12/19/2023, 10:43 PM

You need to override it to match the DEFAULT naming scheme, not the actual naming scheme in use

creamy-pencil-82913

12/19/2023, 10:43 PM

https://github.com/kubernetes/cloud-provider-aws/blob/master/docs/prerequisites.md

creamy-pencil-82913

12/19/2023, 10:43 PM

When IP-based naming is used, the nodes must be named after the instance followed by the regional domain name (
ip-xxx-xxx-xxx-xxx.ec2.<region>.internal
). If you have custom domain name set in the DHCP options, you must set
--hostname-override
on kube-proxy and kubelet to match the above-mentioned naming convention.

When resource based naming is used, the node must be named after the instance either with or without a domain name (
i-1234567890abcdefg
or
i-1234567890abcdefg.<region>.compute.internal
). A custom domain name, configured through DHCP options, may also be used.

creamy-pencil-82913

12/19/2023, 10:44 PM

so you need to override it to be one of those two formats, not the value returned by

hostname

abundant-hair-58573

12/19/2023, 10:44 PM

Oh I see, I guess I was reading that wrong.

abundant-hair-58573

12/19/2023, 10:46 PM

I had been looking at that in the docs, but can't find an example anywhere of what exactly to put as the override. Unless I'm missing the obvious

creamy-pencil-82913

12/19/2023, 10:46 PM

its right there

creamy-pencil-82913

12/19/2023, 10:47 PM

needs to be overriden to

ip-xxx-xxx-xxx-xxx.ec2.<region>.internal

i-1234567890abcdefg.<region>.compute.internal

naming scheme

abundant-hair-58573

12/19/2023, 10:52 PM

ok, I guess that's where I'm confused. Then I just need to set it for each node when they join the cluster, I can't put it in the cluster config without some shell commands and parsing of the hostname?

creamy-pencil-82913

12/19/2023, 10:58 PM

You could drop something like this into the image you’re using as a userdata script to set it up ahead of time:

Copy code

#!/bin/sh
TOKEN=`curl -s -X PUT "<http://169.254.169.254/latest/api/token>" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600"`
REGION=`curl -s -H "X-aws-ec2-metadata-token: $TOKEN" <http://169.254.169.254/latest/meta-data/placement/region>`
INSTANCE=`curl -s -H "X-aws-ec2-metadata-token: $TOKEN" <http://169.254.169.254/latest/meta-data/instance-id>`

echo "overriding hostname as ${INSTANCE}.${REGION}.compute.internal"

mkdir -p /etc/rancher/rke2/config.yaml.d
echo <<EOF >/etc/rancher/rke2/config.yaml.d/99-aws-id.yaml
  kubelet-arg+:
    - --hostname-override=${INSTANCE}.${REGION}.compute.internal
  kube-proxy-arg+:
    - --hostname-override=${INSTANCE}.${REGION}.compute.internal
EOF

abundant-hair-58573

12/19/2023, 11:00 PM

awesome thank you!

creamy-pencil-82913

12/19/2023, 11:01 PM

I believe that should work

abundant-hair-58573

12/19/2023, 11:02 PM

I'll give it a shot, you've been a huge help and cleared up a lot of stuff, thanks again

creamy-pencil-82913

12/19/2023, 11:14 PM

lol I wanted

cat <<EOF

not

echo <<EOF

of course but otherewise should work I think

abundant-hair-58573

12/20/2023, 12:27 AM

Is there anywhere else I need to set the override? I tweaked it a bit so it's using the ip instead of resource name, but this is what the 99-aws-id.yaml looks like

Copy code

kubelet-arg+:
    - --hostname-override=<ip_address>.ec2.us-east-1.internal
  kube-proxy-arg+:
    - --hostname-override=<ip_address>.ec2.us-east-1.internal

Now it hasn't even gotten to starting the cloud controller, the rke2.server.service log shows this error

Copy code

"Waiting for control-plane node <ip_address>.our.domain startup: nodes \"<ip_address>.our.domain\" not found"

So somewhere it's still trying to use the actual hostname of the node, with our custom domain

abundant-hair-58573

12/20/2023, 12:29 AM

Do I need to add something to the registration command? I'm just using the command given by the UI under cluster registration when you select the role.

creamy-pencil-82913

12/20/2023, 12:36 AM

you’re not overriding the node name, just the hostname

creamy-pencil-82913

12/20/2023, 12:37 AM

is the kubelet running successfully?

abundant-hair-58573

12/20/2023, 12:41 AM

yea but I mean somewhere it must be using the hostname as the nodename still. Do I need to set the node-name in a config file as well? https://docs.rke2.io/install/requirements#prerequisites

ps -ef |grep kubelet

does show the kubelet running

abundant-hair-58573

12/20/2023, 12:43 AM

It does show the hostname-override option in the

ps

output of the kubelet process so that part definitely worked

creamy-pencil-82913

12/20/2023, 12:48 AM

the kubelet should be responsible for creating the node resource, I would check the kubelet logs and try to see why that’s not happening

creamy-pencil-82913

12/20/2023, 12:48 AM

note that you’re not overriding the node name, just the hostname field on the node resource. Check the kubelet logs and see what exactly it’s doing.

abundant-hair-58573

12/20/2023, 12:55 AM

Ah, yea, the kubelet log on the control plane is full of

Copy code

"Attempting to register node" node="<IP>.ec2.us-east-1.internal"
"Unable to register node with API server" err="nodes \"<IP>.ec2.us-east-1.internal\" is forbidden: node \"<IP>.our.domain\" is not allowed to modify node \"<IP>.ec2.us-east-1.internal"

abundant-hair-58573

12/20/2023, 12:56 AM

"<IP>.our.domain is not allowd to modify node <IP>.ec2.us-east-1.internal" seems odd

abundant-hair-58573

12/20/2023, 1:00 AM

creamy-pencil-82913

12/20/2023, 1:00 AM

you might want to also set

Copy code

node-name: <whatever>

creamy-pencil-82913

12/20/2023, 1:01 AM

see if setting the node name to the same as the overridden hostname helps

abundant-hair-58573

12/20/2023, 1:01 AM

Cool, I'll add that to the script you gave me. Is that fixable on my nodes currently trying to register and provision or do I need to kill it all and start over? Not big deal only takes a couple minutes now I've done it enough times haha

creamy-pencil-82913

12/20/2023, 1:02 AM

the servers in particular will be confused if you change that after the fact

abundant-hair-58573

12/20/2023, 1:02 AM

ok, I figures as much

abundant-hair-58573

12/20/2023, 3:34 AM

Ok, adding the node-name did the trick! Also just to note for anyone else stumbling on this. I know you gave me the script for overriding the name with the resource based name and I changed it to ip based. Since I'm in us-east-1 our internal DNS names are actually <ip_name>.ec2.internal, it doesn't have the region as part of the name in us-east-1, so I had to change that little part in order for it to work. https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-naming.html Thanks again Brandon, this has stumped me for a very long time!

175 Views

Open in Slack

Previous Next