This message was deleted.
# kubernetes
a
This message was deleted.
c
the cloud provider needs to be deployed during cluster provisioning, as the rancher cluster agent won’t be able to run until that’s done. That effectively means that you need to put EVERYTHING necessary to get the out-of-tree cloud provider working properly in a user manifest, so that it is installed during cluster bootstrap.
a
Ah, interesting. I added the cloud-controller in the addition-manifests section of the cluster config and thought I got all of the tags right. Running
crictl ps -a
on the controlplane shows the aws-cloud-controller-manager with a bunch of failed restarts, last line in the log shows that I must have an old tag lying around somehere
Copy code
Cloud provider could not be initialized: could not init cloud provider "aws": Found multiple cluster tags with prefix <http://kubernetes.io/cluster/|kubernetes.io/cluster/>
c
ahh sounds like you’re close!
a
it looks like that's from trying to describe ec2 instances, so does that mean ANY instance in my VPC with multiple cluster tags will cause that problem? I assumed I'd just need to fix the instances I added to the cluster, but I guess that makes sense it'd look at all of them
c
I’d defer to the upstream docs on how they expect things to be tagged
In my experience it is fine to have multiple clusters in the same VPC, you just need to make sure you set up the tags and the config to match, and be unique per cluster.
a
ok cool that makes sense, thank you. Aside from that, all of the steps in this section should be all I need? The only other thing I had to do different was the hostname-override because our naming conventions differ form hostname.ec.internal
Copy code
machineGlobalConfig:
      cni: canal
      disable-kube-proxy: false
      etcd-expose-metrics: false
      kube-apiserver-arg:
        - cloud-provider=external
      kube-controller-manager-arg:
        - cloud-provider=external
      kube-proxy-arg:
        - '--hostname-override="$(hostname -f)"'
      kube-scheduler-arg: []
    machinePools: null
    machineSelectorConfig:
      - config:
          cloud-provider-name: aws
          kubelet-arg:
            - '--hostname-override="$(hostname -f)"'
I wish we could just configure it straight up with aws as the cloud provider but we can't get static AWS credentials in our environment
c
- '--hostname-override="$(hostname -f)"'
does that… work? I wouldn’t have expected shell expressions to be supported like that.
if it does, cool lol
yeahhhh I don’t think that’ll work
it doesn’t get expanded by the shell. It is just run directly.
a
I have no idea, I tried getting this working in our RKE1 cluster about a year ago, I've been looking through my old tickets today and I noted in one of my tickets that someone in this slack suggested doing that. Fingers crossed haha I wasn't able to find good documentation on how to set those overrides. I see plenty of docs that say to do it for the kubelet on all nodes and the kube-proxy, but couldn't find the exact syntax anywhere. Actually I think it might not be working or there something else I have to set, because the cloud-controller-manager is working now that I cleaned up those three tags, but I'm seeing this in the log
Copy code
error syncing '<ip_address>our.domain': failed to get provider ID for node <ip_address>.our.domain at cloudprovider: failed to get instance ID from cloud provider: instance not found, requeuing
c
Copy code
spec:
  containers:
  - args:
    - --cluster-cidr=10.42.0.0/16
    - --conntrack-max-per-core=0
    - --conntrack-tcp-timeout-close-wait=0s
    - --conntrack-tcp-timeout-established=0s
    - --healthz-bind-address=127.0.0.1
    - --hostname-override="$(hostname -f)"
    - --kubeconfig=/var/lib/rancher/rke2/agent/kubeproxy.kubeconfig
    - --proxy-mode=iptables
    command:
    - kube-proxy
I get this:
Copy code
root@rke2-server-1:/# cat /var/log/pods/kube-system_kube-proxy-rke2-server-1_e778576655760f9a1185c3599f8f57f0/kube-proxy/0.log
2023-12-19T22:38:09.257758437Z stderr F E1219 22:38:09.257613       1 server.go:1039] "Failed to retrieve node info" err="nodes \"\\\"$(hostname -f)\\\"\" not found"
a
bummer. Any idea on how to set that dynamically? Or does it need to go in each nodes user data script?
c
You need to override it to match the DEFAULT naming scheme, not the actual naming scheme in use
When IP-based naming is used, the nodes must be named after the instance followed by the regional domain name (
ip-xxx-xxx-xxx-xxx.ec2.<region>.internal
). If you have custom domain name set in the DHCP options, you must set
--hostname-override
on kube-proxy and kubelet to match the above-mentioned naming convention.
When resource based naming is used, the node must be named after the instance either with or without a domain name (
i-1234567890abcdefg
or
i-1234567890abcdefg.<region>.compute.internal
). A custom domain name, configured through DHCP options, may also be used.
so you need to override it to be one of those two formats, not the value returned by
hostname
a
Oh I see, I guess I was reading that wrong.
I had been looking at that in the docs, but can't find an example anywhere of what exactly to put as the override. Unless I'm missing the obvious
c
its right there
needs to be overriden to
ip-xxx-xxx-xxx-xxx.ec2.<region>.internal
or
i-1234567890abcdefg.<region>.compute.internal
naming scheme
a
ok, I guess that's where I'm confused. Then I just need to set it for each node when they join the cluster, I can't put it in the cluster config without some shell commands and parsing of the hostname?
c
You could drop something like this into the image you’re using as a userdata script to set it up ahead of time:
Copy code
#!/bin/sh
TOKEN=`curl -s -X PUT "<http://169.254.169.254/latest/api/token>" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600"`
REGION=`curl -s -H "X-aws-ec2-metadata-token: $TOKEN" <http://169.254.169.254/latest/meta-data/placement/region>`
INSTANCE=`curl -s -H "X-aws-ec2-metadata-token: $TOKEN" <http://169.254.169.254/latest/meta-data/instance-id>`

echo "overriding hostname as ${INSTANCE}.${REGION}.compute.internal"

mkdir -p /etc/rancher/rke2/config.yaml.d
echo <<EOF >/etc/rancher/rke2/config.yaml.d/99-aws-id.yaml
  kubelet-arg+:
    - --hostname-override=${INSTANCE}.${REGION}.compute.internal
  kube-proxy-arg+:
    - --hostname-override=${INSTANCE}.${REGION}.compute.internal
EOF
a
awesome thank you!
c
I believe that should work
a
I'll give it a shot, you've been a huge help and cleared up a lot of stuff, thanks again
c
lol I wanted
cat <<EOF
not
echo <<EOF
of course but otherewise should work I think
a
Is there anywhere else I need to set the override? I tweaked it a bit so it's using the ip instead of resource name, but this is what the 99-aws-id.yaml looks like
Copy code
kubelet-arg+:
    - --hostname-override=<ip_address>.ec2.us-east-1.internal
  kube-proxy-arg+:
    - --hostname-override=<ip_address>.ec2.us-east-1.internal
Now it hasn't even gotten to starting the cloud controller, the rke2.server.service log shows this error
Copy code
"Waiting for control-plane node <ip_address>.our.domain startup: nodes \"<ip_address>.our.domain\" not found"
So somewhere it's still trying to use the actual hostname of the node, with our custom domain
Do I need to add something to the registration command? I'm just using the command given by the UI under cluster registration when you select the role.
c
you’re not overriding the node name, just the hostname
is the kubelet running successfully?
a
yea but I mean somewhere it must be using the hostname as the nodename still. Do I need to set the node-name in a config file as well? https://docs.rke2.io/install/requirements#prerequisites
ps -ef |grep kubelet
does show the kubelet running
It does show the hostname-override option in the
ps
output of the kubelet process so that part definitely worked
c
the kubelet should be responsible for creating the node resource, I would check the kubelet logs and try to see why that’s not happening
note that you’re not overriding the node name, just the hostname field on the node resource. Check the kubelet logs and see what exactly it’s doing.
a
Ah, yea, the kubelet log on the control plane is full of
Copy code
"Attempting to register node" node="<IP>.ec2.us-east-1.internal"
"Unable to register node with API server" err="nodes \"<IP>.ec2.us-east-1.internal\" is forbidden: node \"<IP>.our.domain\" is not allowed to modify node \"<IP>.ec2.us-east-1.internal"
"<IP>.our.domain is not allowd to modify node <IP>.ec2.us-east-1.internal" seems odd
c
you might want to also set
Copy code
node-name: <whatever>
see if setting the node name to the same as the overridden hostname helps
a
Cool, I'll add that to the script you gave me. Is that fixable on my nodes currently trying to register and provision or do I need to kill it all and start over? Not big deal only takes a couple minutes now I've done it enough times haha
c
the servers in particular will be confused if you change that after the fact
a
ok, I figures as much
Ok, adding the node-name did the trick! Also just to note for anyone else stumbling on this. I know you gave me the script for overriding the name with the resource based name and I changed it to ip based. Since I'm in us-east-1 our internal DNS names are actually <ip_name>.ec2.internal, it doesn't have the region as part of the name in us-east-1, so I had to change that little part in order for it to work. https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-naming.html Thanks again Brandon, this has stumped me for a very long time!
175 Views