Accidentally put this in General - <https://ranche...
# rke2
a
c
That log message should be showing the node name (as it would show in
kubectl get nodes
), not the IP. Did something happen that caused the node name to change?
a
Sorry, yes it's the node name, our nodes are called ip-xxx-xxx-xxx-xxx.domain
c
Is the name that it says its waiting for the same as the node name in
kubectl get nodes
?
a
actually no, in kubectl get nodes it has ip-xxx-xxx-xxx-xxx.us-iso-east-1.compute.internal. But I have that set in
/etc/rancher/rke2/config.yaml.d/99-aws-id.yaml
with
Copy code
kubelet-arg+:
  - --hostname-override=ip-xxx-xxx-xxx-xxx.us-iso-east-1.compute.internal
kube-proxy-arg+:
  - --hostname-override=ip-xxx-xxx-xxx-xxx.us-iso-east-1.compute.internal
node-name: ip-xxx-xxx-xxx-xxx.us-iso-east-1.compute.internal
node-label+
    - node-type=controlplane
c
uhhh yeah don’t do that
that is what the
node-name: xxx
option is for. If you just go poking at the hostname override in individual component args, rke2 itself will not be aware of that.
or wait I am confused, is that indentation how you have it? I misread it because the node-name and node-label are indented when they should not be.
a
hmm... this has always worked though, it was fine going from 1.27 to 1.28, and it works in a new 1.32 cluster I built. I think you might have pointed me towards that a couple years ago when I was struggling getting the aws cloud controller working
no ignore the formatting, I'm transcribing by hand across networks, that's all in air-gapped network
c
I do suspect it has something to do with your node name and hostname override settings.
a
ok, I'll double check the formatting and all that. My last day on this job is tomorrow, I was just trying to run people through the rke2 upgrade process real quick to make sure my documentation was correct!
Brand new controlplanes join the cluster just fine with the way I have it formatted. That's been in our terraform for years. This only happened with this specific upgrade 🤷
c
is that all that you have set in your config?
so just to be clear, you have node-name set to
ip-xxx-xxx-xxx-xxx.us-iso-east-1.compute.internal
but the log says it is looking for
ip-xxx-xxx-xxx-xxx
without the fqdn?
a
Yep, that's all that I have in my config. I'm trying the upgrade in our non airgapped network now and getting the same thing. Straight copy/paste here from
/etc/rancher/rke2/config.yaml.d/99-aws-id.yaml
Copy code
kubelet-arg+:
  - --hostname-override=ip-xxx-xxx-xxx-xxx.ec2.internal
kube-proxy-arg+:
  - --hostname-override=ip-xxx-xxx-xxx-xxx.ec2.internal
node-name: ip-xxx-xxx-xxx-xxx.ec2.internal
This is from the rke2-server log on the control plane that's trying to upgrade
Copy code
rke2[27169]: time="2025-07-10T14:05:26Z" level=info msg="Waiting for control-plane node ip-xxx-xxx-xxx-xxx.domain.org startup: nodes \"ip-xxx-xxx-xxx-xxx.domain.org\" not found"
This is going form 1.28.15 to 1.29.15. Our cloud controller manager was still at 1.27.x from our initial install, I didn't upgrade that way back when I upgraded from 1.27.x to 1.28.15. First thing I did here was upgrade the cloud controller manager to 1.28.11, then I just added the plan to the SUC
Copy code
# Server plan
apiVersion: upgrade.cattle.io/v1
kind: Plan
metadata:
  name: controlplane-plan-v1-29-15
  namespace: cattle-system
  labels:
    rke2-upgrade: controlplane
spec:
  concurrency: 1
  nodeSelector:
    matchExpressions:
       - {key: node-role.kubernetes.io/control-plane, operator: In, values: ["true"]}
  tolerations:
  - key: "node-role.kubernetes.io/control-plane"
    operator: "Equal"
    effect: "NoSchedule"
  - key: "CriticalAddonsOnly"
    operator: "Equal"
    value: "true"
    effect: "NoExecute"
  serviceAccountName: system-upgrade-controller
  cordon: true
  upgrade:
    image: rancher/rke2-upgrade
  version: v1.29.15+rke2r1
Hmmm I see this in the kubelet log on that node
Copy code
I0710 14:18:01.299173   27479 status_manager.go:877] "Failed to update status for pod" pod="kube-system/kube-proxy-ip-xxx-xxx-xxx-xxx.ec2.internal" err="failed to patch status \"{\\\"metadata\\\":{\\\"uid\\\":\\\"a1c6fb9d-34b3-45a5-9adf-d50451828562\\\"},\\\"status\\\":{\\\"$setElementOrder/conditions\\\":[{\\\"type\\\":\\\"PodReadyToStartContainers\\\"},{\\\"type\\\":\\\"Initialized\\\"},{\\\"type\\\":\\\"Ready\\\"},{\\\"type\\\":\\\"ContainersReady\\\"},{\\\"type\\\":\\\"PodScheduled\\\"}],\\\"conditions\\\":[{\\\"lastProbeTime\\\":null,\\\"lastTransitionTime\\\":\\\"2025-07-10T14:04:31Z\\\",\\\"status\\\":\\\"True\\\",\\\"type\\\":\\\"PodReadyToStartContainers\\\"},{\\\"lastTransitionTime\\\":\\\"2025-07-10T14:04:49Z\\\",\\\"status\\\":\\\"True\\\",\\\"type\\\":\\\"Ready\\\"},{\\\"lastTransitionTime\\\":\\\"2025-07-10T14:04:49Z\\\",\\\"type\\\":\\\"ContainersReady\\\"}],\\\"containerStatuses\\\":[{\\\"containerID\\\":\\\"<containerd://ca9ff807b8758ff432cb1d5b355dc79259311198edad8f4de046885f376b46d>5\\\",\\\"image\\\":\\\"<http://docker-remote.artifactory.domain.org/rancher/hardened-kubernetes:v1.29.15-rke2r1-build20250312\\\|docker-remote.artifactory.domain.org/rancher/hardened-kubernetes:v1.29.15-rke2r1-build20250312\\\>",\\\"imageID\\\":\\\"<http://docker-remote.artifactory.domain.org.org/rancher/hardened-kubernetes@sha256:34aaaf8700ef979929c3b1dbfb2d8de2b25c00a68a6a6b540293d6f576cb89fd\\\|docker-remote.artifactory.domain.org.org/rancher/hardened-kubernetes@sha256:34aaaf8700ef979929c3b1dbfb2d8de2b25c00a68a6a6b540293d6f576cb89fd\\\>",\\\"lastState\\\":{},\\\"name\\\":\\\"kube-proxy\\\",\\\"ready\\\":true,\\\"restartCount\\\":0,\\\"started\\\":true,\\\"state\\\":{\\\"running\\\":{\\\"startedAt\\\":\\\"2025-07-10T14:04:30Z\\\"}}}],\\\"hostIPs\\\":[{\\\"ip\\\":\\\"10.114.49.20\\\"}]}}\" for pod \"kube-system\"/\"kube-proxy-ip-xxx-xxx-xxx-xxx.ec2.internal\": pods \"kube-proxy-ip-xxx-xxx-xxx-xxx.ec2.internal\" is forbidden: node \"<http://ip-xxx-xxx-xxx-xxx.domain.org|ip-xxx-xxx-xxx-xxx.domain.org>\" can only update pod status for pods with spec.nodeName set to itself"
Ok, I think this was a bug with rke2 v1.29.15. Just for a sanity check I tried upgrading from 1.28.15 to 1.29.9 (just picked a random minor release version) and that worked with the exact same config
I'm not going to bother putting in a ticket since this is a relatively old version of RKE2 at this point, unless you'd like me to
c
Hmm that would not be any bug I'm aware of. If you go to 1.29.15 after 1.29.9 does it work ok? But yeah would not be fixed, 1.29 has been eol for a while and is not getting any more releases.
a
Good question, we'll have to give that a shot and see if that makes a difference. We'll want to get as close to the latest 1.29 patch version anyways when we upgrade that cluster to 1.30