https://rancher.com/ logo
Title
p

plain-refrigerator-80586

09/14/2022, 3:12 PM
Hello, I have a question related to the rancher k8s upgrades master class hosted by @happy-wire-88980 https://github.com/mattmattox/Kubernetes-Master-Class/tree/main/rancher-k8s-upgrades The update of every component was covered except for the OS upgrade. How do you managed them? The easiest way to do it would be to upgrade one node at a time: cordon/drain -> updates -> reboot -> uncordon, this is a long and tidy process. For the admin cluster, I'm wondering if replacing not up to date nodes with new ones based on an up to date template is a good idea. For the downstream cluster, creating a new cluster with up to date nodes and moving the workload from on cluster to another. This offer a way to move back everything to the previous cluster is something goes wrong. Any though about this or some pointers on how to handle the OS updates for rancher and k8s nodes?
b

bulky-sunset-52084

09/14/2022, 5:54 PM
For the admin cluster, I'm wondering if replacing not up to date nodes with new ones based on an up to date template is a good idea.
💯 that would be the most "cattle not pets" way of doing it. ephemeral nodes will make your life a million times easier then trying to manage a bunch of individual linux nodes. SUSE also sells a product called SUMA (Open Source: Uyuni) if the full blown node replacements don't work... hope this helps 🙂
p

plain-refrigerator-80586

09/15/2022, 6:39 AM
Yes it helps but still don't know how to handle this cattle aproach. If we have an admin cluster with 3 nodes (acting as worker, cp & etcd) how to proceed? Adding 3 new nodes to the admin cluster, after they are registered remove the 3 old nodes? Or adding/removing 1 node at a time?
b

bulky-sunset-52084

09/15/2022, 10:23 PM
both methods should work alright - mostly it's just down to preference. if you do it all at once you might have a blip of downtime as the jobs will all need to be rescheduled. what is your tolerance for an outage?
h

happy-wire-88980

09/15/2022, 10:25 PM
So I normally recommend what we call node rehydration which means that you don't change nodes in-place, you build new nodes and remove the old
That of course is easier said than done but it's one of those things that you'll pay of the hard work today for an easier tomorrow
But you replacing node is not an option, I have some scripts for doing rolling OS upgrades.
Note: I built this for my lab so it does things like sleep between nodes for 900 seconds to avoid pods moving around like crazy during patching and to allow longhorn to fully rebuild between nodes.
#!/bin/bash

while getopts "c:h" opt; do
  case $opt in
    c)
      CLUSTER="${OPTARG}"
      ;;
    h)
      help && exit 0
      ;;
    :)
      echo "Option -$OPTARG requires an argument."
      exit 1
      ;;
    *)
      help && exit 0
  esac
done

if [[ -z "${CLUSTER}" ]]; then
  echo "Please specify a cluster name."
  exit 1
fi

export KUBECONFIG=~/.kube/${CLUSTER}
kubeconfig=~/.kube/${CLUSTER}
cd ~/scripts/rolling-patching/

check_ssh() {
  echo "Checking ${server}"
  until ssh -q -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no root@${server} 'uptime' > /dev/null
  do
    echo "Trying again..."
    sleep 1
  done
}

echo "Starting patching..."
for server in `kubectl get nodes -o name | awk -F '/' '{print $2}'`
do
  i=0
  if ping -c 1 $server
  then
    echo "Server is pingable..."
    scp -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no ./90forceyes root@${server}:/etc/apt/apt.conf.d/
    scp -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no ./release-upgrades root@${server}:/etc/update-manager/release-upgrades
    echo "Draining node..."
    kubectl --kubeconfig ${kubeconfig} drain --delete-emptydir-data --ignore-daemonsets ${server}
    echo "Running apt update and upgrade"
    ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no root@${server} 'export DEBIAN_FRONTEND=noninteractive; apt-get update && apt-get -o Dpkg::Options::="--force-confold" -o Dpkg::Options::=--force-confdef upgrade -q -y --allow-downgrades --allow-remove-essential --allow-change-held-packages && apt-get -o Dpkg::Options::="--force-confold" -o Dpkg::Options::=--force-confdef dist-upgrade -q -y --allow-downgrades --allow-remove-essential --allow-change-held-packages && reboot'
    sleep 60
    check_ssh
    echo "Running do-release-upgrade"
    ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no root@${server} 'export DEBIAN_FRONTEND=noninteractive; do-release-upgrade -f DistUpgradeViewNonInteractive; reboot'
    sleep 450
    echo "Uncordon node..,"
    kubectl --kubeconfig ${kubeconfig} uncordon ${server}
    echo "Sleeping..."
    sleep 900
  else
    echo "Skipping..."
  fi
done
p

plain-refrigerator-80586

09/16/2022, 7:00 AM
~@happy-wire-88980 why is replacing node in my case not an option? Is it because I only have 3 nodes in my admin cluster? If that's the issue here, what's the correct admin cluster setup to use the node rehydration? Since I'm working with the rke terraform provider and VM templates, I'm pretty flexible and have the opportunity to deploy any number of nodes to integrate them to my admin cluster.~
@happy-wire-88980 why is replacing node in my case is not an option? Is it because I only have 3 nodes in my admin cluster? If that's the issue here, what's the correct admin cluster setup to use the node rehydration? Since I'm working with the rke terraform provider and VM templates, I'm pretty flexible and have the opportunity to deploy any number of nodes to integrate them to my admin cluster.
h

happy-wire-88980

09/16/2022, 7:01 AM
So the process I normally follow for the local cluster (3 nodes, all roles all nodes) is to add 3 new nodes then remove the old nodes one at a time.
We don't want to remove all 3 old nodes at once because that can lead to a split brain in etcd
p

plain-refrigerator-80586

09/16/2022, 7:29 AM
OK thanks @happy-wire-88980 and @bulky-sunset-52084 for your help, I'll try this approach.