This message was deleted Rancher Users #masterclass

Join Slack

This message was deleted.

# masterclass

adamant-kite-43734

09/14/2022, 3:12 PM

This message was deleted.

bulky-sunset-52084

09/14/2022, 5:54 PM

For the admin cluster, I'm wondering if replacing not up to date nodes with new ones based on an up to date template is a good idea.

💯 that would be the most "cattle not pets" way of doing it. ephemeral nodes will make your life a million times easier then trying to manage a bunch of individual linux nodes. SUSE also sells a product called SUMA (Open Source: Uyuni) if the full blown node replacements don't work... hope this helps 🙂

plain-refrigerator-80586

09/15/2022, 6:39 AM

Yes it helps but still don't know how to handle this cattle aproach. If we have an admin cluster with 3 nodes (acting as worker, cp & etcd) how to proceed? Adding 3 new nodes to the admin cluster, after they are registered remove the 3 old nodes? Or adding/removing 1 node at a time?

bulky-sunset-52084

09/15/2022, 10:23 PM

both methods should work alright - mostly it's just down to preference. if you do it all at once you might have a blip of downtime as the jobs will all need to be rescheduled. what is your tolerance for an outage?

happy-wire-88980

09/15/2022, 10:25 PM

So I normally recommend what we call node rehydration which means that you don't change nodes in-place, you build new nodes and remove the old

happy-wire-88980

09/15/2022, 10:26 PM

That of course is easier said than done but it's one of those things that you'll pay of the hard work today for an easier tomorrow

happy-wire-88980

09/15/2022, 10:27 PM

But you replacing node is not an option, I have some scripts for doing rolling OS upgrades.

happy-wire-88980

09/15/2022, 10:32 PM

Note: I built this for my lab so it does things like sleep between nodes for 900 seconds to avoid pods moving around like crazy during patching and to allow longhorn to fully rebuild between nodes.

Copy code

#!/bin/bash

while getopts "c:h" opt; do
  case $opt in
    c)
      CLUSTER="${OPTARG}"
      ;;
    h)
      help && exit 0
      ;;
    :)
      echo "Option -$OPTARG requires an argument."
      exit 1
      ;;
    *)
      help && exit 0
  esac
done

if [[ -z "${CLUSTER}" ]]; then
  echo "Please specify a cluster name."
  exit 1
fi

export KUBECONFIG=~/.kube/${CLUSTER}
kubeconfig=~/.kube/${CLUSTER}
cd ~/scripts/rolling-patching/

check_ssh() {
  echo "Checking ${server}"
  until ssh -q -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no root@${server} 'uptime' > /dev/null
  do
    echo "Trying again..."
    sleep 1
  done
}

echo "Starting patching..."
for server in `kubectl get nodes -o name | awk -F '/' '{print $2}'`
do
  i=0
  if ping -c 1 $server
  then
    echo "Server is pingable..."
    scp -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no ./90forceyes root@${server}:/etc/apt/apt.conf.d/
    scp -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no ./release-upgrades root@${server}:/etc/update-manager/release-upgrades
    echo "Draining node..."
    kubectl --kubeconfig ${kubeconfig} drain --delete-emptydir-data --ignore-daemonsets ${server}
    echo "Running apt update and upgrade"
    ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no root@${server} 'export DEBIAN_FRONTEND=noninteractive; apt-get update && apt-get -o Dpkg::Options::="--force-confold" -o Dpkg::Options::=--force-confdef upgrade -q -y --allow-downgrades --allow-remove-essential --allow-change-held-packages && apt-get -o Dpkg::Options::="--force-confold" -o Dpkg::Options::=--force-confdef dist-upgrade -q -y --allow-downgrades --allow-remove-essential --allow-change-held-packages && reboot'
    sleep 60
    check_ssh
    echo "Running do-release-upgrade"
    ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no root@${server} 'export DEBIAN_FRONTEND=noninteractive; do-release-upgrade -f DistUpgradeViewNonInteractive; reboot'
    sleep 450
    echo "Uncordon node..,"
    kubectl --kubeconfig ${kubeconfig} uncordon ${server}
    echo "Sleeping..."
    sleep 900
  else
    echo "Skipping..."
  fi
done

plain-refrigerator-80586

09/16/2022, 7:00 AM

~@happy-wire-88980 why is replacing node in my case not an option? Is it because I only have 3 nodes in my admin cluster? If that's the issue here, what's the correct admin cluster setup to use the node rehydration? Since I'm working with the rke terraform provider and VM templates, I'm pretty flexible and have the opportunity to deploy any number of nodes to integrate them to my admin cluster.~

plain-refrigerator-80586

09/16/2022, 7:00 AM

@happy-wire-88980 why is replacing node in my case is not an option? Is it because I only have 3 nodes in my admin cluster? If that's the issue here, what's the correct admin cluster setup to use the node rehydration? Since I'm working with the rke terraform provider and VM templates, I'm pretty flexible and have the opportunity to deploy any number of nodes to integrate them to my admin cluster.

happy-wire-88980

09/16/2022, 7:01 AM

So the process I normally follow for the local cluster (3 nodes, all roles all nodes) is to add 3 new nodes then remove the old nodes one at a time.

happy-wire-88980

09/16/2022, 7:02 AM

We don't want to remove all 3 old nodes at once because that can lead to a split brain in etcd

plain-refrigerator-80586

09/16/2022, 7:29 AM

OK thanks @happy-wire-88980 and @bulky-sunset-52084 for your help, I'll try this approach.

4 Views

Open in Slack

Previous Next