For the admin cluster, I'm wondering if replacing not up to date nodes with new ones based on an up to date template is a good idea.
💯 that would be the most "cattle not pets" way of doing it. ephemeral nodes will make your life a million times easier then trying to manage a bunch of individual linux nodes. SUSE also sells a product called SUMA (Open Source: Uyuni) if the full blown node replacements don't work... hope this helps 🙂
Yes it helps but still don't know how to handle this cattle aproach. If we have an admin cluster with 3 nodes (acting as worker, cp & etcd) how to proceed? Adding 3 new nodes to the admin cluster, after they are registered remove the 3 old nodes? Or adding/removing 1 node at a time?
both methods should work alright - mostly it's just down to preference. if you do it all at once you might have a blip of downtime as the jobs will all need to be rescheduled. what is your tolerance for an outage?
So I normally recommend what we call node rehydration which means that you don't change nodes in-place, you build new nodes and remove the old
That of course is easier said than done but it's one of those things that you'll pay of the hard work today for an easier tomorrow
But you replacing node is not an option, I have some scripts for doing rolling OS upgrades.
Note: I built this for my lab so it does things like sleep between nodes for 900 seconds to avoid pods moving around like crazy during patching and to allow longhorn to fully rebuild between nodes.
Copy code

while getopts "c:h" opt; do
  case $opt in
      help && exit 0
      echo "Option -$OPTARG requires an argument."
      exit 1
      help && exit 0

if [[ -z "${CLUSTER}" ]]; then
  echo "Please specify a cluster name."
  exit 1

export KUBECONFIG=~/.kube/${CLUSTER}
cd ~/scripts/rolling-patching/

check_ssh() {
  echo "Checking ${server}"
  until ssh -q -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no root@${server} 'uptime' > /dev/null
    echo "Trying again..."
    sleep 1

echo "Starting patching..."
for server in `kubectl get nodes -o name | awk -F '/' '{print $2}'`
  if ping -c 1 $server
    echo "Server is pingable..."
    scp -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no ./90forceyes root@${server}:/etc/apt/apt.conf.d/
    scp -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no ./release-upgrades root@${server}:/etc/update-manager/release-upgrades
    echo "Draining node..."
    kubectl --kubeconfig ${kubeconfig} drain --delete-emptydir-data --ignore-daemonsets ${server}
    echo "Running apt update and upgrade"
    ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no root@${server} 'export DEBIAN_FRONTEND=noninteractive; apt-get update && apt-get -o Dpkg::Options::="--force-confold" -o Dpkg::Options::=--force-confdef upgrade -q -y --allow-downgrades --allow-remove-essential --allow-change-held-packages && apt-get -o Dpkg::Options::="--force-confold" -o Dpkg::Options::=--force-confdef dist-upgrade -q -y --allow-downgrades --allow-remove-essential --allow-change-held-packages && reboot'
    sleep 60
    echo "Running do-release-upgrade"
    ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no root@${server} 'export DEBIAN_FRONTEND=noninteractive; do-release-upgrade -f DistUpgradeViewNonInteractive; reboot'
    sleep 450
    echo "Uncordon node..,"
    kubectl --kubeconfig ${kubeconfig} uncordon ${server}
    echo "Sleeping..."
    sleep 900
    echo "Skipping..."
@happy-wire-88980 why is replacing node in my case is not an option? Is it because I only have 3 nodes in my admin cluster? If that's the issue here, what's the correct admin cluster setup to use the node rehydration? Since I'm working with the rke terraform provider and VM templates, I'm pretty flexible and have the opportunity to deploy any number of nodes to integrate them to my admin cluster.
So the process I normally follow for the local cluster (3 nodes, all roles all nodes) is to add 3 new nodes then remove the old nodes one at a time.
We don't want to remove all 3 old nodes at once because that can lead to a split brain in etcd
OK thanks @happy-wire-88980 and @bulky-sunset-52084 for your help, I'll try this approach.