This message was deleted Rancher Users #harvester

Join Slack

This message was deleted.

# harvester

adamant-kite-43734

06/12/2023, 1:24 PM

This message was deleted.

big-judge-33880

06/12/2023, 1:37 PM

Copy code

hvst-upgrade-mrs26-post-drain-har-01   1/1           64s        170m
hvst-upgrade-mrs26-post-drain-har-02   1/1           74s        91m
hvst-upgrade-mrs26-post-drain-har-04   0/1           46m        47m

(1.1.1 to 1.1.2 upgrade)

big-judge-33880

06/12/2023, 2:19 PM

Also trying to create a support bundle, but it errors out downloading, with this in rke2-ingress logs:

[rke2-ingress-nginx-controller-2sb97] 2023/06/12 14:18:05 [error] 452#452: *2850162 upstream prematurely closed connection while sending to client, client: 10.52.3.0, server: _, request: "GET /v1/harvester/supportbundles/bundle-it7ia/download HTTP/2.0", upstream: "<http://10.52.5.169:80/v1/harvester/supportbundles/bundle-it7ia/download>

big-judge-33880

06/12/2023, 2:26 PM

I wonder if this from the rke2-server journal on the affected node is related:

Copy code

Jun 12 12:50:21 har-04 rke2[106121]: time="2023-06-12T12:50:21Z" level=info msg="Labels and annotations have been set successfully on node: har-04"
Jun 12 12:50:45 har-04 rke2[108737]: time="2023-06-12T12:50:45Z" level=warning msg="Failed to remove cgroup (will retry)" error="rmdir /sys/fs/cgroup/memory/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pode1c70cd2_b314_492c_9552_275313f6cc38.slice/cri-containerd-a9374d4010b19827f941dd9ca10b20322b75483c8627ec2c4af57b1d7f8e1ea6.scope: device or resource busy"
Jun 12 12:50:45 har-04 rke2[108737]: time="2023-06-12T12:50:45Z" level=warning msg="Failed to remove cgroup (will retry)" error="rmdir /sys/fs/cgroup/blkio/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pode1c70cd2_b314_492c_9552_275313f6cc38.slice/cri-containerd-a9374d4010b19827f941dd9ca10b20322b75483c8627ec2c4af57b1d7f8e1ea6.scope: device or resource busy"
Jun 12 12:50:45 har-04 rke2[108737]: time="2023-06-12T12:50:45Z" level=warning msg="Failed to remove cgroup (will retry)" error="rmdir /sys/fs/cgroup/perf_event/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pode1c70cd2_b314_492c_9552_275313f6cc38.slice/cri-containerd-a9374d4010b19827f941dd9ca10b20322b75483c8627ec2c4af57b1d7f8e1ea6.scope: device or resource busy"

job

startTime: "2023-06-12T12:50:01Z"

big-judge-33880

06/12/2023, 2:44 PM

I rebooted the node and it eventually came up with 1.1.1 (as seen on console) after failing to boot 1.1.2 due to missing active.img - it seems to have rejoined the cluster with an updated rke2 version and 1.1.1s OS otherwise

big-judge-33880

06/12/2023, 2:49 PM

This leads to the next post-drain job to have issues with active.img:

ERRO[2023-06-12T14:43:14Z] Failed to move /run/initramfs/cos-state/cOS/active.img to /run/initramfs/cos-state/cOS/passive.img: exit status 1

big-judge-33880

06/12/2023, 3:14 PM

1. Remounted /run/initramfs/cos-state rw 2. Copied /run/initramfs/cos-state/cOS/active.img from a yet to be upgraded node 3. Remounted /run/initramfs/cos-state ro This got me a bit further:

Copy code

INFO[2023-06-12T15:13:45Z] Moving /run/initramfs/cos-state/cOS/active.img to /run/initramfs/cos-state/cOS/passive.img
INFO[2023-06-12T15:13:46Z] Finished moving /run/initramfs/cos-state/cOS/active.img to /run/initramfs/cos-state/cOS/passive.img
INFO[2023-06-12T15:13:46Z] Moving /run/initramfs/cos-state/cOS/transition.img to /run/initramfs/cos-state/cOS/active.img
INFO[2023-06-12T15:13:46Z] Finished moving /run/initramfs/cos-state/cOS/transition.img to /run/initramfs/cos-state/cOS/active.img
INFO[2023-06-12T15:13:46Z] Applying 'after-upgrade' hook
INFO[2023-06-12T15:13:46Z] Running after-upgrade hook
INFO[2023-06-12T15:13:46Z] Upgrade completed
ERRO[2023-06-12T15:13:46Z] Failed mounting device /dev/sda3 with label COS_STATE

big-judge-33880

06/12/2023, 3:37 PM

Skipped past this by adding

/tmp/skip-retry-with-succeed

in the pod since it seemed everything that needs to be done was already done Then ran these commands on the host (with pod_name being the pod I forced into success above)

Copy code

HARVESTER_UPGRADE_POD_NAME=hvst-upgrade-mrs26-post-drain-har-04-ckl2t

cat > /tmp/upgrade-reboot.sh << EOF
#!/bin/bash -ex
HARVESTER_UPGRADE_POD_NAME=$HARVESTER_UPGRADE_POD_NAME

EOF

cat >> /tmp/upgrade-reboot.sh << 'EOF'
source /etc/bash.bashrc.local
pod_id=$(crictl pods --name $HARVESTER_UPGRADE_POD_NAME --namespace harvester-system -o json | jq -er '.items[0].id')

# get `upgrade` container ID
container_id=$(crictl ps --pod $pod_id --name apply -o json -a | jq -er '.containers[0].id')
container_state=$(crictl inspect $container_id | jq -er '.status.state')

if [ "$container_state" = "CONTAINER_EXITED" ]; then
  container_exit_code=$(crictl inspect $container_id | jq -r '.status.exitCode')

  if [ "$container_exit_code" = "0" ]; then
    sleep 10

    # workaround for <https://github.com/harvester/harvester/issues/2865>
    # kubelet could start from old manifest first and generate a new manifest later.
    rm -f /var/lib/rancher/rke2/agent/pod-manifests/*

    reboot
    exit 0
  fi
fi

exit 1
EOF

chmod +x /tmp/upgrade-reboot.sh

cat > /run/systemd/system/upgrade-reboot.service << 'EOF'
[Unit]
Description=Upgrade reboot
[Service]
Type=simple
ExecStart=/tmp/upgrade-reboot.sh
Restart=always
RestartSec=10
EOF

systemctl daemon-reload
systemctl start upgrade-reboot

big-judge-33880

06/12/2023, 3:55 PM

This seems to have made the node reboot with the correct rke2 and kernel versions

big-judge-33880

06/12/2023, 3:56 PM

However, shortly after (all three server nodes are now upgraded) an agent node stopped posting ready and is NotReady - I don’t see a pre-drain job on it

big-judge-33880

06/12/2023, 3:58 PM

Copy code

nodeStatuses:
    har-01:
      state: Succeeded
    har-02:
      state: Succeeded
    har-03:
      state: Images preloaded
    har-04:
      state: Succeeded
    har-05:
      state: Images preloaded
    har-06:
      state: Images preloaded
  previousVersion: v1.1.1
  repoInfo: |
    release:
        harvester: v1.1.2
        harvesterChart: 1.1.2
        os: Harvester v1.1.2
        kubernetes: v1.24.11+rke2r1
        rancher: v2.6.11
        monitoringChart: 100.1.0+up19.0.3
        minUpgradableVersion: v1.1.0

the node in question being har-03

big-judge-33880

06/12/2023, 4:01 PM

har-06 also shortly went unavailable, but recovered fairly quickly. har-03 is causing all hell to let loose re: rebuilding replicas in longhorn

big-judge-33880

06/12/2023, 4:09 PM

All nodes now back and on the same rke2-server version, but non-server nodes do not appear to get their pre-drain jobs started - is there a way to force the upgrade controller to continue/start the upgrade job for a node?

big-judge-33880

06/12/2023, 7:34 PM

Looking at the code (my go reading comprehension is pretty poor), I think it’s controlled by annotations on the machine plan secrets in fleet-local?

great-bear-19718

06/12/2023, 11:23 PM

anything specific on har-03 which is causing the replica rebuilds?

big-judge-33880

06/13/2023, 7:38 AM

I couldn’t find anything, really @great-bear-19718 - all volumes are healthy now, but I’m left with three server nodes upgraded and three missing 1.1.2 upgrade

big-judge-33880

06/13/2023, 7:45 AM

is there a way to start node pre drain jobs (and then post jobs) manually?

big-judge-33880

06/13/2023, 8:53 PM

I got the pre-drain job going by annotating the corresponding machine-plan secret in fleet-local with

<http://rke.cattle.io/pre-drain|rke.cattle.io/pre-drain>: '{"deleteEmptyDirData":true,"disableEviction":false,"enabled":true,"force":true,"gracePeriod":0,"ignoreDaemonSets":true,"ignoreErrors":false,"postDrainHooks":[{"annotation":"<http://harvesterhci.io/post-hook|harvesterhci.io/post-hook>"}],"preDrainHooks":[{"annotation":"<http://harvesterhci.io/pre-hook|harvesterhci.io/pre-hook>"}],"skipWaitForDeleteTimeoutSeconds":0,"timeout":0}'

, then post-drain job by annotating using

<http://rke.cattle.io/post-drain|rke.cattle.io/post-drain>: '{"deleteEmptyDirData":true,"disableEviction":false,"enabled":true,"force":true,"gracePeriod":0,"ignoreDaemonSets":true,"ignoreErrors":false,"postDrainHooks":[{"annotation":"<http://harvesterhci.io/post-hook|harvesterhci.io/post-hook>"}],"preDrainHooks":[{"annotation":"<http://harvesterhci.io/pre-hook|harvesterhci.io/pre-hook>"}],"skipWaitForDeleteTimeoutSeconds":0,"timeout":0}'

, which brings the node to Successful status, rebooted to 1.1.2 as far as the upgrade crd is concerned

big-judge-33880

06/13/2023, 8:59 PM

watching my test cluster during its own upgrade I see it sets these as well - not sure if they’re needed/what they do:

Copy code

<http://rke.cattle.io/drain-done|rke.cattle.io/drain-done>: '{"deleteEmptyDirData":true,"disableEviction":false,"enabled":true,"force":true,"gracePeriod":0,"ignoreDaemonSets":true,"ignoreErrors":false,"postDrainHooks":[{"annotation":"<http://harvesterhci.io/post-hook|harvesterhci.io/post-hook>"}],"preDrainHooks":[{"annotation":"<http://harvesterhci.io/pre-hook|harvesterhci.io/pre-hook>"}],"skipWaitForDeleteTimeoutSeconds":0,"timeout":0}'
    <http://rke.cattle.io/drain-options|rke.cattle.io/drain-options>: '{"deleteEmptyDirData":true,"disableEviction":false,"enabled":true,"force":true,"gracePeriod":0,"ignoreDaemonSets":true,"ignoreErrors":false,"postDrainHooks":[{"annotation":"<http://harvesterhci.io/post-hook|harvesterhci.io/post-hook>"}],"preDrainHooks":[{"annotation":"<http://harvesterhci.io/pre-hook|harvesterhci.io/pre-hook>"}],"skipWaitForDeleteTimeoutSeconds":0,"timeout":0}'

137 Views

Open in Slack

Previous Next