This message was deleted.
# harvester
a
This message was deleted.
b
Copy code
hvst-upgrade-mrs26-post-drain-har-01   1/1           64s        170m
hvst-upgrade-mrs26-post-drain-har-02   1/1           74s        91m
hvst-upgrade-mrs26-post-drain-har-04   0/1           46m        47m
(1.1.1 to 1.1.2 upgrade)
Also trying to create a support bundle, but it errors out downloading, with this in rke2-ingress logs:
[rke2-ingress-nginx-controller-2sb97] 2023/06/12 14:18:05 [error] 452#452: *2850162 upstream prematurely closed connection while sending to client, client: 10.52.3.0, server: _, request: "GET /v1/harvester/supportbundles/bundle-it7ia/download HTTP/2.0", upstream: "<http://10.52.5.169:80/v1/harvester/supportbundles/bundle-it7ia/download>
I wonder if this from the rke2-server journal on the affected node is related:
Copy code
Jun 12 12:50:21 har-04 rke2[106121]: time="2023-06-12T12:50:21Z" level=info msg="Labels and annotations have been set successfully on node: har-04"
Jun 12 12:50:45 har-04 rke2[108737]: time="2023-06-12T12:50:45Z" level=warning msg="Failed to remove cgroup (will retry)" error="rmdir /sys/fs/cgroup/memory/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pode1c70cd2_b314_492c_9552_275313f6cc38.slice/cri-containerd-a9374d4010b19827f941dd9ca10b20322b75483c8627ec2c4af57b1d7f8e1ea6.scope: device or resource busy"
Jun 12 12:50:45 har-04 rke2[108737]: time="2023-06-12T12:50:45Z" level=warning msg="Failed to remove cgroup (will retry)" error="rmdir /sys/fs/cgroup/blkio/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pode1c70cd2_b314_492c_9552_275313f6cc38.slice/cri-containerd-a9374d4010b19827f941dd9ca10b20322b75483c8627ec2c4af57b1d7f8e1ea6.scope: device or resource busy"
Jun 12 12:50:45 har-04 rke2[108737]: time="2023-06-12T12:50:45Z" level=warning msg="Failed to remove cgroup (will retry)" error="rmdir /sys/fs/cgroup/perf_event/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pode1c70cd2_b314_492c_9552_275313f6cc38.slice/cri-containerd-a9374d4010b19827f941dd9ca10b20322b75483c8627ec2c4af57b1d7f8e1ea6.scope: device or resource busy"
job
startTime: "2023-06-12T12:50:01Z"
I rebooted the node and it eventually came up with 1.1.1 (as seen on console) after failing to boot 1.1.2 due to missing active.img - it seems to have rejoined the cluster with an updated rke2 version and 1.1.1s OS otherwise
This leads to the next post-drain job to have issues with active.img:
ERRO[2023-06-12T14:43:14Z] Failed to move /run/initramfs/cos-state/cOS/active.img to /run/initramfs/cos-state/cOS/passive.img: exit status 1
1. Remounted /run/initramfs/cos-state rw 2. Copied /run/initramfs/cos-state/cOS/active.img from a yet to be upgraded node 3. Remounted /run/initramfs/cos-state ro This got me a bit further:
Copy code
INFO[2023-06-12T15:13:45Z] Moving /run/initramfs/cos-state/cOS/active.img to /run/initramfs/cos-state/cOS/passive.img
INFO[2023-06-12T15:13:46Z] Finished moving /run/initramfs/cos-state/cOS/active.img to /run/initramfs/cos-state/cOS/passive.img
INFO[2023-06-12T15:13:46Z] Moving /run/initramfs/cos-state/cOS/transition.img to /run/initramfs/cos-state/cOS/active.img
INFO[2023-06-12T15:13:46Z] Finished moving /run/initramfs/cos-state/cOS/transition.img to /run/initramfs/cos-state/cOS/active.img
INFO[2023-06-12T15:13:46Z] Applying 'after-upgrade' hook
INFO[2023-06-12T15:13:46Z] Running after-upgrade hook
INFO[2023-06-12T15:13:46Z] Upgrade completed
ERRO[2023-06-12T15:13:46Z] Failed mounting device /dev/sda3 with label COS_STATE
Skipped past this by adding
/tmp/skip-retry-with-succeed
in the pod since it seemed everything that needs to be done was already done Then ran these commands on the host (with pod_name being the pod I forced into success above)
Copy code
HARVESTER_UPGRADE_POD_NAME=hvst-upgrade-mrs26-post-drain-har-04-ckl2t

cat > /tmp/upgrade-reboot.sh << EOF
#!/bin/bash -ex
HARVESTER_UPGRADE_POD_NAME=$HARVESTER_UPGRADE_POD_NAME

EOF

cat >> /tmp/upgrade-reboot.sh << 'EOF'
source /etc/bash.bashrc.local
pod_id=$(crictl pods --name $HARVESTER_UPGRADE_POD_NAME --namespace harvester-system -o json | jq -er '.items[0].id')

# get `upgrade` container ID
container_id=$(crictl ps --pod $pod_id --name apply -o json -a | jq -er '.containers[0].id')
container_state=$(crictl inspect $container_id | jq -er '.status.state')

if [ "$container_state" = "CONTAINER_EXITED" ]; then
  container_exit_code=$(crictl inspect $container_id | jq -r '.status.exitCode')

  if [ "$container_exit_code" = "0" ]; then
    sleep 10

    # workaround for <https://github.com/harvester/harvester/issues/2865>
    # kubelet could start from old manifest first and generate a new manifest later.
    rm -f /var/lib/rancher/rke2/agent/pod-manifests/*

    reboot
    exit 0
  fi
fi

exit 1
EOF

chmod +x /tmp/upgrade-reboot.sh

cat > /run/systemd/system/upgrade-reboot.service << 'EOF'
[Unit]
Description=Upgrade reboot
[Service]
Type=simple
ExecStart=/tmp/upgrade-reboot.sh
Restart=always
RestartSec=10
EOF

systemctl daemon-reload
systemctl start upgrade-reboot
This seems to have made the node reboot with the correct rke2 and kernel versions
However, shortly after (all three server nodes are now upgraded) an agent node stopped posting ready and is NotReady - I don’t see a pre-drain job on it
Copy code
nodeStatuses:
    har-01:
      state: Succeeded
    har-02:
      state: Succeeded
    har-03:
      state: Images preloaded
    har-04:
      state: Succeeded
    har-05:
      state: Images preloaded
    har-06:
      state: Images preloaded
  previousVersion: v1.1.1
  repoInfo: |
    release:
        harvester: v1.1.2
        harvesterChart: 1.1.2
        os: Harvester v1.1.2
        kubernetes: v1.24.11+rke2r1
        rancher: v2.6.11
        monitoringChart: 100.1.0+up19.0.3
        minUpgradableVersion: v1.1.0
the node in question being har-03
har-06 also shortly went unavailable, but recovered fairly quickly. har-03 is causing all hell to let loose re: rebuilding replicas in longhorn
All nodes now back and on the same rke2-server version, but non-server nodes do not appear to get their pre-drain jobs started - is there a way to force the upgrade controller to continue/start the upgrade job for a node?
Looking at the code (my go reading comprehension is pretty poor), I think it’s controlled by annotations on the machine plan secrets in fleet-local?
g
anything specific on har-03 which is causing the replica rebuilds?
b
I couldn’t find anything, really @great-bear-19718 - all volumes are healthy now, but I’m left with three server nodes upgraded and three missing 1.1.2 upgrade
is there a way to start node pre drain jobs (and then post jobs) manually?
I got the pre-drain job going by annotating the corresponding machine-plan secret in fleet-local with
<http://rke.cattle.io/pre-drain|rke.cattle.io/pre-drain>: '{"deleteEmptyDirData":true,"disableEviction":false,"enabled":true,"force":true,"gracePeriod":0,"ignoreDaemonSets":true,"ignoreErrors":false,"postDrainHooks":[{"annotation":"<http://harvesterhci.io/post-hook|harvesterhci.io/post-hook>"}],"preDrainHooks":[{"annotation":"<http://harvesterhci.io/pre-hook|harvesterhci.io/pre-hook>"}],"skipWaitForDeleteTimeoutSeconds":0,"timeout":0}'
, then post-drain job by annotating using
<http://rke.cattle.io/post-drain|rke.cattle.io/post-drain>: '{"deleteEmptyDirData":true,"disableEviction":false,"enabled":true,"force":true,"gracePeriod":0,"ignoreDaemonSets":true,"ignoreErrors":false,"postDrainHooks":[{"annotation":"<http://harvesterhci.io/post-hook|harvesterhci.io/post-hook>"}],"preDrainHooks":[{"annotation":"<http://harvesterhci.io/pre-hook|harvesterhci.io/pre-hook>"}],"skipWaitForDeleteTimeoutSeconds":0,"timeout":0}'
, which brings the node to Successful status, rebooted to 1.1.2 as far as the upgrade crd is concerned
watching my test cluster during its own upgrade I see it sets these as well - not sure if they’re needed/what they do:
Copy code
<http://rke.cattle.io/drain-done|rke.cattle.io/drain-done>: '{"deleteEmptyDirData":true,"disableEviction":false,"enabled":true,"force":true,"gracePeriod":0,"ignoreDaemonSets":true,"ignoreErrors":false,"postDrainHooks":[{"annotation":"<http://harvesterhci.io/post-hook|harvesterhci.io/post-hook>"}],"preDrainHooks":[{"annotation":"<http://harvesterhci.io/pre-hook|harvesterhci.io/pre-hook>"}],"skipWaitForDeleteTimeoutSeconds":0,"timeout":0}'
    <http://rke.cattle.io/drain-options|rke.cattle.io/drain-options>: '{"deleteEmptyDirData":true,"disableEviction":false,"enabled":true,"force":true,"gracePeriod":0,"ignoreDaemonSets":true,"ignoreErrors":false,"postDrainHooks":[{"annotation":"<http://harvesterhci.io/post-hook|harvesterhci.io/post-hook>"}],"preDrainHooks":[{"annotation":"<http://harvesterhci.io/pre-hook|harvesterhci.io/pre-hook>"}],"skipWaitForDeleteTimeoutSeconds":0,"timeout":0}'
132 Views