I'm facing an issue on a single node cluster using...
# cluster-api
m
I'm facing an issue on a single node cluster using
spec.agentConfig.additionalUserData.data
. The node gets deleted as soon as it comes up due to
Rolling 1 replicas with outdated spec (0 replicas up to date)
. I suspect this is because the formatting of the additional user data changes slightly from the rke2controlplane to the rke2config due to the way it's rendered. I have
maxSurge
set to 0. Does it seem like this could be the cause?
w
In single-node clusters, setting
maxSurge: 0
means the current node must be deleted before a new one can be provisioned. If
spec.agentConfig.additionalUserData.data
is even slightly different in formatting between the
rke2controlplane
and the rendered
rke2config
(e.g., due to whitespace or quote style), it triggers a change in the machine spec. Since the cluster sees it as an outdated spec and no extra node can be created (because
maxSurge: 0
), it deletes the only node — causing the cluster to go down. Recommendation: Temporarily set
maxSurge: 1
and
maxUnavailable: 0
to allow the new node to come up before the old one is removed. That will prevent this destructive behavior and let you validate if the issue is caused by spec drift.
m
Thank you I will try that. I paused machine reconciliation immediately after it came up to prevent deletion and I could see that there were slight whitespace differences in the rendered rke2config and the rke2controlplane. What I don't understand is why there are differences only the first time it renders the config for the initial node. Maybe a different code path?