https://rancher.com/ logo
Title
d

damp-crayon-64796

10/26/2022, 12:19 PM
tried Upgrading from 1.0.3 to 1.1.0 but failed: seeing the log messages: + virtctl start upgrade-repo-hvst-upgrade-xpf52 -n harvester-system Try to bring up the upgrade repo VM... Error starting VirtualMachine virtualmachine.kubevirt.io "upgrade-repo-hvst-upgrade-xpf52" not found State is: System service upgrade succeeded but the first node failed. I see also not the VM upgrade-repo-hvst-upgrade-xpf52 at all? Any idea?
a

ancient-pizza-13099

10/26/2022, 12:37 PM
@damp-crayon-64796 How about the resource of your each NODE, cpu cores, memory and storage. If cpu cores < 12, it may not be enough to deal with v1.0.3+upgrade peak resource requirements.
cc @prehistoric-balloon-31801 @red-king-19196 @ancient-pilot-51731
@damp-crayon-64796 could you help log an issue in github https://github.com/harvester/harvester/issues , and add the support-bundle filehttps://docs.harvesterhci.io/v1.0/troubleshooting/harvester/#generate-a-support-bundle thanks.
d

damp-crayon-64796

10/26/2022, 12:40 PM
2 CPUs with 10 cores each harvester1:/usr/local # lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 46 bits physical, 48 bits virtual CPU(s): 40 On-line CPU(s) list: 0-39 Thread(s) per core: 2 Core(s) per socket: 10 Socket(s): 2
a

ancient-pizza-13099

10/26/2022, 12:43 PM
uptime
in harvester1 please
d

damp-crayon-64796

10/26/2022, 12:44 PM
harvester1:/usr/local # uptime 12:43:59 up 21 days 6:23, 1 user, load average: 1.10, 1.42, 1.20
a

ancient-pizza-13099

10/26/2022, 12:48 PM
harveste1
is not rebooted yet.
In the log, there should be many occurances of
virtctl start upgrade-repo-hvst-upgrade-xpf52 -n harvester-system
, is the
not found
returned from the very beginning ? or after certain time
maybe the pod
upgrade-repo-hvst-upgrade-xpf52
was removed after enough retry-times
d

damp-crayon-64796

10/26/2022, 12:51 PM
directly at the beginning the messages appeared
k get pod NAME READY STATUS RESTARTS AGE harvester-859d59d7c4-9nlk8 0/1 Pending 0 145m harvester-859d59d7c4-csrdf 1/1 Running 0 153m harvester-859d59d7c4-dlw2q 1/1 Running 0 153m harvester-load-balancer-686957bdfc-pktm6 1/1 Running 2 (21d ago) 35d harvester-network-controller-22hnv 1/1 Running 0 153m harvester-network-controller-4h8xc 1/1 Running 1 (143m ago) 153m harvester-network-controller-cpskp 1/1 Running 0 153m harvester-network-controller-manager-7f56fd5d45-f2l45 1/1 Running 0 153m harvester-network-controller-manager-7f56fd5d45-lq8vn 1/1 Running 0 145m harvester-network-webhook-57f74f7568-lm4wr 1/1 Running 0 153m harvester-node-disk-manager-5kl9p 1/1 Running 0 153m harvester-node-disk-manager-kdfkx 1/1 Running 0 153m harvester-node-disk-manager-kwt94 1/1 Running 1 (143m ago) 153m harvester-node-manager-ng4fm 1/1 Running 0 153m harvester-node-manager-q74g5 1/1 Running 0 153m harvester-node-manager-rhhxq 1/1 Running 1 (143m ago) 153m harvester-webhook-ff874d44-8z748 1/1 Running 0 153m harvester-webhook-ff874d44-mzn5t 1/1 Running 0 153m harvester-webhook-ff874d44-v589t 0/1 Pending 0 145m hvst-upgrade-xpf52-post-drain-harvester1-527kb 0/1 Error 0 143m hvst-upgrade-xpf52-post-drain-harvester1-7qj7f 0/1 Error 0 141m hvst-upgrade-xpf52-post-drain-harvester1-cht6v 0/1 Error 0 141m hvst-upgrade-xpf52-post-drain-harvester1-czfbw 0/1 Error 0 141m hvst-upgrade-xpf52-post-drain-harvester1-lgptq 1/1 Running 0 141m hvst-upgrade-xpf52-post-drain-harvester1-tngv7 0/1 Error 0 141m hvst-upgrade-xpf52-post-drain-harvester1-v4t89 0/1 Error 0 141m hvst-upgrade-xpf52-post-drain-harvester1-w9hsk 0/1 Error 0 141m kube-vip-4spln 1/1 Running 17 (143m ago) 72d kube-vip-5g2w7 1/1 Running 8 (21d ago) 72d kube-vip-cloud-provider-0 1/1 Running 7 (21d ago) 71d kube-vip-vsbhd 1/1 Running 11 (21d ago) 72d virt-api-77cdfbf56f-x7hhf 1/1 Running 0 146m virt-api-77cdfbf56f-zvvsc 1/1 Running 0 150m virt-controller-657f55f68c-586kp 1/1 Running 0 151m virt-controller-657f55f68c-fbkdx 1/1 Running 0 145m virt-handler-g89tl 1/1 Running 1 (143m ago) 152m virt-handler-nzvjj 1/1 Running 0 151m virt-handler-x5mns 1/1 Running 0 152m virt-operator-c6ff785d7-924gn 1/1 Running 0 153m
a

ancient-pizza-13099

10/26/2022, 12:52 PM
are there any error log in
hvst-upgrade-xpf52-post-drain-harvester1-lgptq
?
d

damp-crayon-64796

10/26/2022, 12:53 PM
+++ dirname /usr/local/bin/upgrade_node.sh ++ cd /usr/local/bin ++ pwd + SCRIPT_DIR=/usr/local/bin + source /usr/local/bin/lib.sh ++ UPGRADE_NAMESPACE=harvester-system ++ UPGRADE_REPO_URL=http://upgrade-repo-hvst-upgrade-xpf52.harvester-system/harvester-iso ++ UPGRADE_REPO_VM_NAME=upgrade-repo-hvst-upgrade-xpf52 ++ UPGRADE_REPO_RELEASE_FILE=http://upgrade-repo-hvst-upgrade-xpf52.harvester-system/harvester-iso/harvester-release.yaml ++ UPGRADE_REPO_SQUASHFS_IMAGE=http://upgrade-repo-hvst-upgrade-xpf52.harvester-system/harvester-iso/rootfs.squashfs ++ UPGRADE_REPO_BUNDLE_ROOT=http://upgrade-repo-hvst-upgrade-xpf52.harvester-system/harvester-iso/bundle ++ UPGRADE_REPO_BUNDLE_METADATA=http://upgrade-repo-hvst-upgrade-xpf52.harvester-system/harvester-iso/bundle/metadata.yaml ++ CACHED_BUNDLE_METADATA= ++ HOST_DIR=/host + UPGRADE_TMP_DIR=/host/usr/local/upgrade_tmp + mkdir -p /host/usr/local/upgrade_tmp + case $1 in + command_post_drain + wait_repo ++ get_repo_vm_status ++ kubectl get virtualmachines.kubevirt.io upgrade-repo-hvst-upgrade-xpf52 -n harvester-system '-o=jsonpath={.status.printableStatus}' Error from server (NotFound): virtualmachines.kubevirt.io "upgrade-repo-hvst-upgrade-xpf52" not found + [[ '' == \R\u\n\n\i\n\g ]] + echo 'Try to bring up the upgrade repo VM...' + virtctl start upgrade-repo-hvst-upgrade-xpf52 -n harvester-system Try to bring up the upgrade repo VM... Error starting VirtualMachine virtualmachine.kubevirt.io "upgrade-repo-hvst-upgrade-xpf52" not found + true + sleep 10 ... and this will loop
a

ancient-pizza-13099

10/26/2022, 12:55 PM
kubectl get vm -A kubectl get vmi -A
d

damp-crayon-64796

10/26/2022, 12:56 PM
[skoch@mgmt-sk ~]$ kubectl get vm -A NAMESPACE NAME AGE STATUS READY default hci-k3s-cluster-k3s-179bbfac-ln7jz 21d Stopped False default hci-k3s-cluster-k3s-179bbfac-nrkwz 21d Stopped False default hci-k3s-cluster-k3s-179bbfac-pd6bx 21d Stopped False default hci-rke2-cluster-control-258c3ef0-g5cmh 21d Stopped False default hci-rke2-cluster-control-258c3ef0-qmd6t 21d Stopped False default hci-rke2-cluster-control-258c3ef0-qqrc2 21d Stopped False default hci-rke2-cluster-worker-c76869e8-b9t9s 21d Stopped False default hci-rke2-cluster-worker-c76869e8-csrqj 21d Stopped False default hci-rke2-cluster-worker-c76869e8-kn8pp 21d Stopped False default hci-rke2-cluster-worker-c76869e8-tr24g 21d Stopped False default hci-rke2-cluster-worker-c76869e8-xfcbj 21d Stopped False default rocky9-installed 72d Stopped False default test-efi 48d Stopped False default test-rocky8 72d Stopped False [skoch@mgmt-sk ~]$ kubectl get vmi -A No resources found
a

ancient-pizza-13099

10/26/2022, 12:57 PM
the upgrade vm
upgrade-repo-hvst-upgrade-xpf52
is totally gone, tricky
✔️ 1
d

damp-crayon-64796

10/26/2022, 12:57 PM
Support bundle .... too big for github
a

ancient-pizza-13099

10/26/2022, 1:04 PM
I will spend some time to look into the support bundle file.
The most important message is
logs/harvester-system/harvester-859d59d7c4-dlw2q/apiserver.log:2022-10-26T10:29:18.422203593Z time="2022-10-26T10:29:18Z" level=info msg="Delete upgrade repo VM harvester-system/upgrade-repo-hvst-upgrade-xpf52"
The upgrade repo is deleted.
Seems the upgrade is time-outed, and the
repo
is finally actively
deleted
yamls/namespaced/harvester-system/harvesterhci.io/v1beta1/upgrades.yaml
    creationTimestamp: "2022-10-26T09:41:08Z"
..
    - lastUpdateTime: "2022-10-26T10:29:18Z"
      message: Job has reached the specified backoff limit
      reason: BackoffLimitExceeded
      status: "False"
      type: NodesUpgraded
@damp-crayon-64796 how about the hardware of you cluster NODEs (cpu cores, memory, storage (ssd, nvme or hdd ? ))
status:
    conditions:
    - lastUpdateTime: "2022-10-26T10:29:18Z"
      status: "False"
      type: Completed
    - lastUpdateTime: "2022-10-26T09:46:47Z"
      status: "True"
      type: ImageReady
    - lastUpdateTime: "2022-10-26T09:49:32Z"
      status: "True"
      type: RepoReady
    - lastUpdateTime: "2022-10-26T10:13:28Z"
      status: "True"
      type: NodesPrepared
    - lastUpdateTime: "2022-10-26T10:23:42Z"
      status: "True"
      type: SystemServicesUpgraded
    - lastUpdateTime: "2022-10-26T10:29:18Z"
      message: Job has reached the specified backoff limit
      reason: BackoffLimitExceeded
      status: "False"
      type: NodesUpgraded
d

damp-crayon-64796

10/26/2022, 2:11 PM
First of all: Many thanks for your inverstigations !! The upgrade job timed out, i agree, but i did also not see the VMs during the retries before. So I think it was just never created/downloaded ?? I suspect some fw/network topic in our envinment, because we do not allow every traffic to gou outsite. My Hardware (it is for demo only) are 3 blade wird 2x10cores. 256G RAM and Fast FC storage attached. I dont think it is the HW..... I saw in the Status Page i posted above Download Upgrade Image = 0% ..... is this correct ?
p

prehistoric-balloon-31801

10/26/2022, 2:12 PM
The VM and image are deleted after an upgrade finish (either succeed or fail).
a

ancient-pizza-13099

10/26/2022, 2:14 PM
@damp-crayon-64796 If ISO download fail, the upgrade won't continue.
@prehistoric-balloon-31801, when
10:29
the post-drain job try to bring up
vm
, the
repo vm
has already been deleted, due to
Job has reached the specified backoff limit
previous jobs are all in error
And until
10:23
, the
virt-api
POD is restarted with
upgraded version
p

prehistoric-balloon-31801

10/26/2022, 2:17 PM
yes, the controller is buggy, and mark the upgrade as fail when the job fail at the first time. (So VM is gone)
@damp-crayon-64796 what’s the disk size, I saw it’s FC stoarge. Could you also check
/var/log/containers
see if you can find the first error job:
sudo ls /var/log/containers | grep post-drain
d

damp-crayon-64796

10/26/2022, 2:27 PM
yes, it is FC storage, not really supported, i know. Disk is nearly 6TB in size for the VMs and 120G for the OS
the var/log/containers directory shows an log file hvst-upgrade-xpf52-post-drain-harvester1-lgptq_harvester-system_apply-a9f297ed721dff961079244c4b814e61600ac30152ad3bb78218d217869cbb89.log with the content i was psted above where it loops the messages : 2022-10-26T14:29:36.063803761Z stderr F Error from server (NotFound): virtualmachines.kubevirt.io "upgrade-repo-hvst-upgrade-xpf52" not found 2022-10-26T14:29:36.068506651Z stderr F + [[ '' == \R\u\n\n\i\n\g ]] 2022-10-26T14:29:36.068556416Z stderr F + echo 'Try to bring up the upgrade repo VM...' 2022-10-26T14:29:36.068567017Z stderr F + virtctl start upgrade-repo-hvst-upgrade-xpf52 -n harvester-system 2022-10-26T14:29:36.06851447Z stdout F Try to bring up the upgrade repo VM... 2022-10-26T14:29:36.123610906Z stderr F Error starting VirtualMachine virtualmachine.kubevirt.io "upgrade-repo-hvst-upgrade-xpf52" not found 2022-10-26T14:29:36.125633106Z stderr F + true 2022-10-26T14:29:36.12567131Z stderr F + sleep 10
👌 1
p

prehistoric-balloon-31801

10/26/2022, 2:30 PM
Could you check the free space?
df -h
d

damp-crayon-64796

10/26/2022, 2:31 PM
harvester1:/usr/local # df -h|grep -v ^overlay Filesystem Size Used Avail Use% Mounted on devtmpfs 4.0M 0 4.0M 0% /dev tmpfs 126G 0 126G 0% /dev/shm tmpfs 51G 15M 51G 1% /run tmpfs 4.0M 0 4.0M 0% /sys/fs/cgroup /dev/sdg3 15G 4.4G 9.7G 31% /run/initramfs/cos-state /dev/loop0 3.0G 1.3G 1.6G 45% / tmpfs 63G 12M 63G 1% /run/overlay /dev/sdg2 58M 1.5M 53M 3% /oem /dev/sdg5 95G 55G 36G 61% /usr/local tmpfs 126G 4.0K 126G 1% /tmp /dev/sdh 5.9T 272G 5.3T 5% /var/lib/harvester/defaultdisk tmpfs 1.0G 12K 1.0G 1% /var/lib/kubelet/pods/77b23e02-d9b4-4d5d-9aad-0c638d3e6253/volumes/kubernetes.io~projected/kube-api-access-sllbw tmpfs 252G 12K 252G 1% /var/lib/kubelet/pods/a9014a20-9f2a-45ff-981e-4a5c790cffec/volumes/kubernetes.io~projected/kube-api-access-qcnvx tmpfs 252G 12K 252G 1% /var/lib/kubelet/pods/c00bcf3e-a35d-4035-8f1d-0cf7a6d32c95/volumes/kubernetes.io~projected/kube-api-access-wxpjw ...
p

prehistoric-balloon-31801

10/26/2022, 2:34 PM
I saw some OOM message in dmesg, not quite sure if that’s related.
a

ancient-pizza-13099

10/26/2022, 2:53 PM
OOM looks to be related to ``rancher-logging-root-fluentd-0`` per keyword in kernal OOM
task_memcg=/kubepods/burstable/pod6e088e0d-9998-430f-b332-d8679b11d825
and
ruby
kubelet.log:I1026 10:22:49.113438    2922 reconciler.go:225] "operationExecutor.VerifyControllerAttachedVolume started for volume \"app-config\" (UniqueName: \"<http://kubernetes.io/secret/6e088e0d-9998-430f-b332-d8679b11d825-app-config|kubernetes.io/secret/6e088e0d-9998-430f-b332-d8679b11d825-app-config>\") pod \"rancher-logging-root-fluentd-0\" (UID: \"6e088e0d-9998-430f-b332-d8679b11d825\") "
Did not find the possibility that POD may go into
Error
in following code
@prehistoric-balloon-31801 failure of first POD
p

prehistoric-balloon-31801

10/26/2022, 3:07 PM
I’m also checking this. Any suspicious thing?
a

ancient-pizza-13099

10/26/2022, 3:08 PM
But it does not report again
No retries permitted until 2022-10-26 10:27:04.815090863 +0000 UTC m=+32.013847439 (durationBeforeRetry 4s)
at
2022-10-26 10:27:04.815090863
the running POD
other failed one:
maybe the first POD, due to occasional failure, kubelet mark it as failure
All ohters are after
hvst-upgrade-xpf52-post-drain-harvester1-527kb
, which is the first one failed.
pod
hvst-upgrade-xpf52-post-drain-harvester1-527kb
failed in 2 minutes, caused the first job try-out failure
status:
    conditions:
    - lastProbeTime: "null"
      lastTransitionTime: "2022-10-26T10:26:35Z"
      status: "True"
      type: Initialized
    - lastProbeTime: "null"
      lastTransitionTime: "2022-10-26T10:28:34Z"
      reason: PodFailed
      status: "False"
      type: Ready
    - lastProbeTime: "null"
      lastTransitionTime: "2022-10-26T10:28:34Z"
      reason: PodFailed
      status: "False"
      type: ContainersReady
    - lastProbeTime: "null"
      lastTransitionTime: "2022-10-26T10:26:35Z"
      status: "True"
      type: PodScheduled
p

prehistoric-balloon-31801

10/26/2022, 3:29 PM
@damp-crayon-64796 Could you list
ls /var/lib/rancher/rke2/agent/images/
on harvester1? Thanks
d

damp-crayon-64796

10/26/2022, 3:29 PM
harvester1:~ # ls /var/lib/rancher/rke2/agent/images/ cloud-controller-manager-image.txt etcd-image.txt kube-apiserver-image.txt kube-controller-manager-image.txt kube-proxy-image.txt kube-scheduler-image.txt
👍 1
p

prehistoric-balloon-31801

10/26/2022, 4:19 PM
@ancient-pizza-13099 With this output, we can confirm the
clean_rke2_archives
function is complete.
a

ancient-pizza-13099

10/26/2022, 6:52 PM
@prehistoric-balloon-31801 do you mean, the first
post-drain POD
has ever run, and failed in some other places ?
/yamls/cluster/v1/nodes.yaml

harvester1

    taints:
    - effect: NoSchedule
      key: <http://node.kubernetes.io/unschedulable|node.kubernetes.io/unschedulable>
      timeAdded: "2022-10-26T10:23:42Z"
    unschedulable: true
between the pod running, there is log of network interrupt
d

damp-crayon-64796

10/27/2022, 10:26 AM
my network configuration is very basic, just 2 Nics in an active/backup config for all. Mgmt Nw untagged and the VM-network use some tagged vlans
a

ancient-pizza-13099

10/27/2022, 10:27 AM
no worry, this
flannel
network may affect the internal communication between
pods
, and the
post-drain
pod may fail due to this interrupt, we are checking.
d

damp-crayon-64796

10/27/2022, 10:54 AM
had an look into the log files at
harvester1:/var/log/pods/kube-system_rke2-canal-vqct4_5a7cdf6a-11b7-4539-b6b4-93d7541d57cb/kube-flannel:
here an excerpt
2022-10-05T06:21:48.58830961Z stderr F I1005 06:21:48.588233       1 iptables.go:243] Adding iptables rule: ! -s 10.52.0.0/16 -d 10.52.0.0/16 -m comment --comment flanneld masq -j MASQUERADE --random-fully
2022-10-26T10:26:35.376490188Z stderr F I1026 10:26:35.376360       1 watch.go:39] context canceled, close receiver chan
2022-10-26T10:26:35.376521477Z stderr F I1026 10:26:35.376406       1 vxlan_network.go:75] evts chan closed
2022-10-26T10:26:35.376533744Z stderr F I1026 10:26:35.376449       1 main.go:438] shutdownHandler sent cancel signal...
2022-10-26T10:26:35.376620616Z stderr F W1026 10:26:35.376562       1 reflector.go:436] <http://github.com/flannel-io/flannel/subnet/kube/kube.go:379|github.com/flannel-io/flannel/subnet/kube/kube.go:379>: watch of *v1.Node ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding
2022-10-26T10:26:35.376648476Z stderr F I1026 10:26:35.376622       1 main.go:394] Exiting cleanly...
a

ancient-pizza-13099

10/27/2022, 11:04 AM
please try in your
harvester1
ls /var/log/pods/harvester-system_hvst-upgrade-
``````
check
if hvst-upgrade-xpf52-post-drain-harvester1-527kb  if there
when exist, please attach all the log files under it
d

damp-crayon-64796

10/27/2022, 11:12 AM
harvester1:/var/log/pods # ls
cattle-logging-system_rancher-logging-kube-audit-fluentbit-fh8sm_17e69f4b-95ce-4a4e-b7db-4feb8863757d
cattle-logging-system_rancher-logging-rke2-journald-aggregator-w5g52_9b83dd80-78b0-4336-bf92-ab05460b63a9
cattle-logging-system_rancher-logging-root-fluentbit-plkl8_e837a608-82b4-4642-bdec-7dd0d3cc3106
cattle-monitoring-system_rancher-monitoring-prometheus-node-exporter-zclfn_5ee9d172-73f4-433d-9858-81160d8985d1
cattle-system_system-upgrade-controller-7b8d94c7f5-pss5c_813190f4-6aeb-4b89-aa2d-81a1931f2ddc
harvester-system_harvester-network-controller-4h8xc_6f212eb7-8f62-406c-b113-cc8cfb2fa568
harvester-system_harvester-node-disk-manager-kwt94_115d0c21-edf0-4447-bd41-a78a8345c1ba
harvester-system_harvester-node-manager-rhhxq_40182858-4341-49be-a310-978810512c0f
harvester-system_hvst-upgrade-xpf52-post-drain-harvester1-lgptq_87ece495-1d23-47e6-886f-493ed16bf7c5
harvester-system_kube-vip-4spln_c00bcf3e-a35d-4035-8f1d-0cf7a6d32c95
harvester-system_virt-handler-g89tl_10792870-90c3-4655-8c52-8d754ac78148
kube-system_cloud-controller-manager-harvester1_1a84611ed06607ed8a51e65d936a6ff0
kube-system_etcd-harvester1_e18aa5e5b83a5a3c56d78e4054612394
kube-system_harvester-whereabouts-hqrcz_86497db2-cc43-4ff2-9801-f193e063b713
kube-system_kube-apiserver-harvester1_4874a08227e8932676b83ca998a390f3
kube-system_kube-controller-manager-harvester1_57585a0305e4e46df816ebab263926f3
kube-system_kube-proxy-harvester1_ce051fc91f9e463593a1d45efa60be52
kube-system_kube-scheduler-harvester1_2495d4d1888db1561e78ccbc2ff8677c
kube-system_rke2-canal-vqct4_5a7cdf6a-11b7-4539-b6b4-93d7541d57cb
kube-system_rke2-ingress-nginx-controller-ftfrf_3b272fd8-3e35-415c-9dbe-f585a7664341
kube-system_rke2-multus-ds-dh54w_77b23e02-d9b4-4d5d-9aad-0c638d3e6253
longhorn-system_backing-image-manager-d7ad-1dd5_93309371-8572-49ac-83eb-5fe4c7ef8466
longhorn-system_engine-image-ei-a5371358-ln9kp_6649d1c4-4ee4-4ad5-a27f-496420038bc1
longhorn-system_longhorn-csi-plugin-rcp7s_ec1c39c5-1f76-4571-8b9f-1bbaf85ce7ef
longhorn-system_longhorn-loop-device-cleaner-l2dsb_a9014a20-9f2a-45ff-981e-4a5c790cffec
longhorn-system_longhorn-manager-2clvr_f11f95c1-aa23-42d3-b6cd-927fcefd2b1a
harvester1:/var/log/pods # cd *527*
-bash: cd: *527*: No such file or directory
harvester1:/var/log/pods # cd /var/log/pods/harvester-system_hvst-upgrade-xpf52-post-drain-harvester1-lgptq_87ece495-1d23-47e6-886f-493ed16bf7c5
harvester1:/var/log/pods/harvester-system_hvst-upgrade-xpf52-post-drain-harvester1-lgptq_87ece495-1d23-47e6-886f-493ed16bf7c5 # ls
apply
harvester1:/var/log/pods/harvester-system_hvst-upgrade-xpf52-post-drain-harvester1-lgptq_87ece495-1d23-47e6-886f-493ed16bf7c5 # cd apply/
harvester1:/var/log/pods/harvester-system_hvst-upgrade-xpf52-post-drain-harvester1-lgptq_87ece495-1d23-47e6-886f-493ed16bf7c5/apply # ls -l
total 7964
-rw-r----- 1 root root 8146988 Oct 27 11:07 0.log
harvester1:/var/log/pods/harvester-system_hvst-upgrade-xpf52-post-drain-harvester1-lgptq_87ece495-1d23-47e6-886f-493ed16bf7c5/apply #
and the log file ...
a

ancient-pizza-13099

10/27/2022, 11:25 AM
thanks, let me check it now
it is not from
hvst-upgrade-xpf52-post-drain-harvester1-527kb
, which is deleted by kubelet after certain time.
it is from
hvst-upgrade-xpf52-post-drain-harvester1-lgptq
, which is report
Error starting VirtualMachine <http://virtualmachine.kubevirt.io|virtualmachine.kubevirt.io> "upgrade-repo-hvst-upgrade-xpf52" not found
d

damp-crayon-64796

10/27/2022, 11:30 AM
yes, sorry would have better told, but the *527kb directory is not there
a

ancient-pizza-13099

10/27/2022, 12:13 PM
no worry, that ever existed pod
*527kb
was deleted by kubelet after some time
@damp-crayon-64796 could you help in
harvester1
systemctl status upgrade-reboot
ls /tmp -alth && ls /tmp/upgrade-reboot.sh -alth
d

damp-crayon-64796

10/28/2022, 9:47 AM
sure here it is
harvester1:~ # systemctl status upgrade-reboot
Unit upgrade-reboot.service could not be found.
harvester1:~ # ls /tmp -alth   &&  ls /tmp/upgrade-reboot.sh -alth
total 7.8M
drwxrwxrwt  23 root root  500 Oct 28 09:46 .
-rw-rw-rw-   1 root root 7.8M Oct 27 11:09 0.log
drwx------   2 root root   40 Oct 26 10:29 cachepkgs2350779609
drwx------   2 root root   40 Oct 26 10:29 tmp.Nm8lk73Xg1
drwx------   2 root root   40 Oct 26 10:29 cachepkgs1058812872
drwx------   2 root root   40 Oct 26 10:29 tmp.2uPxVXnnpx
drwx------   2 root root   40 Oct 26 10:29 cachepkgs967314055
drwx------   2 root root   40 Oct 26 10:29 tmp.7kdhisOixl
drwx------   2 root root   40 Oct 26 10:28 cachepkgs3212023921
drwx------   2 root root   40 Oct 26 10:28 tmp.UqPedlw8GW
drwx------   2 root root   40 Oct 26 10:28 cachepkgs2819240199
drwx------   2 root root   40 Oct 26 10:28 tmp.rvjd0nWTRf
drwx------   2 root root   40 Oct 26 10:28 cachepkgs75555247
drwx------   2 root root   40 Oct 26 10:28 tmp.OHVPN6zY1D
drwx------   2 root root   40 Oct 26 10:28 cachepkgs2296860756
drwx------   2 root root   40 Oct 26 10:28 tmp.sSSPqEeQ9y
-rw-------   1 root root  635 Oct 26 10:15 tmp.Fuqfa2nmkW
drwx------   3 root root   60 Oct  5 06:21 systemd-private-a20bc825d3c141dcbde9d255416be891-systemd-logind.service-QRxVQg
drwx------   3 root root   60 Oct  5 06:20 systemd-private-a20bc825d3c141dcbde9d255416be891-systemd-timesyncd.service-w7uxPh
drwxrwxrwt   2 root root   40 Oct  5 06:20 .ICE-unix
drwxrwxrwt   2 root root   40 Oct  5 06:20 .Test-unix
drwxrwxrwt   2 root root   40 Oct  5 06:20 .X11-unix
drwxrwxrwt   2 root root   40 Oct  5 06:20 .XIM-unix
drwxrwxrwt   2 root root   40 Oct  5 06:20 .font-unix
drwxr-xr-x. 22 root root 4.0K Aug 14 17:55 ..
a

ancient-pizza-13099

10/28/2022, 9:49 AM
could you help ls -alt those tmp.* , they are created at `10.26 10:28 and 10:29"
from tmp.sSSPqEeQ9y to cachepkgs2350779609
d

damp-crayon-64796

10/28/2022, 9:52 AM
nothing in there:
harvester1:/tmp # ls -alt tmp.*
-rw------- 1 root root 635 Oct 26 10:15 tmp.Fuqfa2nmkW

tmp.Nm8lk73Xg1:
total 0
drwxrwxrwt 23 root root 500 Oct 28 09:51 ..
drwx------  2 root root  40 Oct 26 10:29 .

tmp.2uPxVXnnpx:
total 0
drwxrwxrwt 23 root root 500 Oct 28 09:51 ..
drwx------  2 root root  40 Oct 26 10:29 .

tmp.7kdhisOixl:
total 0
drwxrwxrwt 23 root root 500 Oct 28 09:51 ..
drwx------  2 root root  40 Oct 26 10:29 .

tmp.UqPedlw8GW:
total 0
drwxrwxrwt 23 root root 500 Oct 28 09:51 ..
drwx------  2 root root  40 Oct 26 10:28 .

tmp.rvjd0nWTRf:
total 0
drwxrwxrwt 23 root root 500 Oct 28 09:51 ..
drwx------  2 root root  40 Oct 26 10:28 .

tmp.OHVPN6zY1D:
total 0
drwxrwxrwt 23 root root 500 Oct 28 09:51 ..
drwx------  2 root root  40 Oct 26 10:28 .

tmp.sSSPqEeQ9y:
total 0
drwxrwxrwt 23 root root 500 Oct 28 09:51 ..
drwx------  2 root root  40 Oct 26 10:28 .
harvester1:/tmp #
a

ancient-pizza-13099

10/28/2022, 9:54 AM
ls -alth cache*
and: cat /etc/mtab
d

damp-crayon-64796

10/28/2022, 9:55 AM
harvester1:/tmp # ls -alth cache*
cachepkgs2350779609:
total 0
drwxrwxrwt 23 root root 500 Oct 28 09:55 ..
drwx------  2 root root  40 Oct 26 10:29 .

cachepkgs1058812872:
total 0
drwxrwxrwt 23 root root 500 Oct 28 09:55 ..
drwx------  2 root root  40 Oct 26 10:29 .

cachepkgs967314055:
total 0
drwxrwxrwt 23 root root 500 Oct 28 09:55 ..
drwx------  2 root root  40 Oct 26 10:29 .

cachepkgs3212023921:
total 0
drwxrwxrwt 23 root root 500 Oct 28 09:55 ..
drwx------  2 root root  40 Oct 26 10:28 .

cachepkgs2819240199:
total 0
drwxrwxrwt 23 root root 500 Oct 28 09:55 ..
drwx------  2 root root  40 Oct 26 10:28 .

cachepkgs75555247:
total 0
drwxrwxrwt 23 root root 500 Oct 28 09:55 ..
drwx------  2 root root  40 Oct 26 10:28 .

cachepkgs2296860756:
total 0
drwxrwxrwt 23 root root 500 Oct 28 09:55 ..
drwx------  2 root root  40 Oct 26 10:28 .
harvester1:/tmp #
cat mtab:
a

ancient-pizza-13099

10/28/2022, 10:00 AM
truncated, please upload via a file, thank
d

damp-crayon-64796

10/28/2022, 10:01 AM
i think it was a file ... at the bottom is a link "see it in full" or what i am doing wrong?
a

ancient-pizza-13099

10/28/2022, 10:01 AM
between Oct 26 10:28 and 10:29, some shell script in the
post-drain
return 1, caused failue, we are checking which cmd
OK, click and get a full of it, thanks
cat /etc/os-release
d

damp-crayon-64796

10/28/2022, 10:13 AM
harvester1:/tmp # cat /etc/os-release
NAME="SLE Micro"
VERSION="5.2"
VERSION_ID="5.2"
PRETTY_NAME="Harvester v1.0.3"
ID="sle-micro-rancher"
ID_LIKE="suse"
ANSI_COLOR="0;32"
CPE_NAME="cpe:/o:suse:sle-micro-rancher:5.2"
VARIANT="Harvester"
VARIANT_ID="Harvester-20220802"
GRUB_ENTRY_NAME="Harvester v1.0.3"
a

ancient-pizza-13099

10/28/2022, 10:14 AM
ok, still v1.0.3
please
ls -alth /usr/local/upgrade_tmp
d

damp-crayon-64796

10/28/2022, 10:15 AM
harvester1:/tmp # ls -alth /usr/local/upgrade_tmp/
total 3.4G
drwxr-xr-x 10 root root 4.0K Oct 26 10:29 ..
-rw-------  1 root root 492M Oct 26 10:29 tmp.9gJwps92DB
drwxr-xr-x  2 root root 4.0K Oct 26 10:29 .
-rw-------  1 root root 492M Oct 26 10:29 tmp.SG1NQ0bSVk
-rw-------  1 root root 492M Oct 26 10:29 tmp.W8Ou1bUAuE
-rw-------  1 root root 492M Oct 26 10:28 tmp.M4QyyFvaoR
-rw-------  1 root root 492M Oct 26 10:28 tmp.GlhwlNnHcS
-rw-------  1 root root 492M Oct 26 10:28 tmp.nLxl4leBp1
-rw-------  1 root root 492M Oct 26 10:28 tmp.cGa78ygm56
a

ancient-pizza-13099

10/28/2022, 10:20 AM
@prehistoric-balloon-31801 The failure happens between
mount and rm
tmp_rootfs_squashfs file is existing in /usr/local/upgrade_tmp, with
492M
size
maybe
umount $tmp_rootfs_mount
fail with
1
or
rm -rf $tmp_rootfs_squashfs
fail, that seems not possible ?
or
chroot $HOST_DIR elemental upgrade --directory ${tmp_rootfs_mount#"$HOST_DIR"}
fail with
1
d

damp-crayon-64796

10/28/2022, 10:25 AM
I tried also do some research and wondering if there not should be an VlanConfig from the convert routine ... but may be wrong
[skoch@mgmt-sk ~]$ k get VlanConfig
No resources found
a

ancient-pizza-13099

10/28/2022, 10:27 AM
or failed with
mount $tmp_rootfs_squashfs $tmp_rootfs_mount
/etc/mtab
has no record of those
tmp.*
@damp-crayon-64796 your case seems not to be related with
VlanConfig
✔️ 1
or your
tmpfs
is full
please
df -H | grep tmpfs
d

damp-crayon-64796

10/28/2022, 10:29 AM
harvester1:/tmp # df -H | grep tmpfs
devtmpfs        4.2M     0  4.2M   0% /dev
tmpfs           136G     0  136G   0% /dev/shm
tmpfs            55G   17M   55G   1% /run
tmpfs           4.2M     0  4.2M   0% /sys/fs/cgroup
tmpfs            68G   13M   68G   1% /run/overlay
tmpfs           136G  8.2M  136G   1% /tmp
tmpfs           1.1G   13k  1.1G   1% /var/lib/kubelet/pods/77b23e02-d9b4-4d5d-9aad-0c638d3e6253/volumes/kubernetes.io~projected/kube-api-access-sllbw
tmpfs           271G   13k  271G   1% /var/lib/kubelet/pods/a9014a20-9f2a-45ff-981e-4a5c790cffec/volumes/kubernetes.io~projected/kube-api-access-qcnvx
tmpfs           271G   13k  271G   1% /var/lib/kubelet/pods/c00bcf3e-a35d-4035-8f1d-0cf7a6d32c95/volumes/kubernetes.io~projected/kube-api-access-wxpjw
tmpfs           271G   13k  271G   1% /var/lib/kubelet/pods/5a7cdf6a-11b7-4539-b6b4-93d7541d57cb/volumes/kubernetes.io~projected/kube-api-access-lht29
tmpfs           135M   13k  135M   1% /var/lib/kubelet/pods/40182858-4341-49be-a310-978810512c0f/volumes/kubernetes.io~projected/kube-api-access-58kdr
tmpfs           210M   13k  210M   1% /var/lib/kubelet/pods/86497db2-cc43-4ff2-9801-f193e063b713/volumes/kubernetes.io~projected/kube-api-access-dvrlb
tmpfs           135M   13k  135M   1% /var/lib/kubelet/pods/6f212eb7-8f62-406c-b113-cc8cfb2fa568/volumes/kubernetes.io~projected/kube-api-access-bl4pv
tmpfs           271G     0  271G   0% /var/lib/kubelet/pods/f11f95c1-aa23-42d3-b6cd-927fcefd2b1a/volumes/kubernetes.io~secret/longhorn-grpc-tls
tmpfs           271G   13k  271G   1% /var/lib/kubelet/pods/f11f95c1-aa23-42d3-b6cd-927fcefd2b1a/volumes/kubernetes.io~projected/kube-api-access-hjfpw
tmpfs           271G   13k  271G   1% /var/lib/kubelet/pods/115d0c21-edf0-4447-bd41-a78a8345c1ba/volumes/kubernetes.io~projected/kube-api-access-4jbh6
tmpfs           271G   13k  271G   1% /var/lib/kubelet/pods/6649d1c4-4ee4-4ad5-a27f-496420038bc1/volumes/kubernetes.io~projected/kube-api-access-bhhj4
tmpfs           271G   13k  271G   1% /var/lib/kubelet/pods/ec1c39c5-1f76-4571-8b9f-1bbaf85ce7ef/volumes/kubernetes.io~projected/kube-api-access-cdk7p
tmpfs           271G  8.2k  271G   1% /var/lib/kubelet/pods/10792870-90c3-4655-8c52-8d754ac78148/volumes/kubernetes.io~secret/kubevirt-virt-handler-server-certs
tmpfs           271G  8.2k  271G   1% /var/lib/kubelet/pods/10792870-90c3-4655-8c52-8d754ac78148/volumes/kubernetes.io~secret/kubevirt-virt-handler-certs
tmpfs           271G  4.1k  271G   1% /var/lib/kubelet/pods/10792870-90c3-4655-8c52-8d754ac78148/volumes/kubernetes.io~downward-api/podinfo
tmpfs           271G   13k  271G   1% /var/lib/kubelet/pods/10792870-90c3-4655-8c52-8d754ac78148/volumes/kubernetes.io~projected/kube-api-access-zx4px
tmpfs           271G   13k  271G   1% /var/lib/kubelet/pods/9b83dd80-78b0-4336-bf92-ab05460b63a9/volumes/kubernetes.io~projected/kube-api-access-q2t4t
tmpfs           210M  4.1k  210M   1% /var/lib/kubelet/pods/17e69f4b-95ce-4a4e-b7db-4feb8863757d/volumes/kubernetes.io~secret/config
tmpfs           210M  4.1k  210M   1% /var/lib/kubelet/pods/e837a608-82b4-4642-bdec-7dd0d3cc3106/volumes/kubernetes.io~secret/config
tmpfs           210M   13k  210M   1% /var/lib/kubelet/pods/17e69f4b-95ce-4a4e-b7db-4feb8863757d/volumes/kubernetes.io~projected/kube-api-access-xdlvm
tmpfs           210M   13k  210M   1% /var/lib/kubelet/pods/e837a608-82b4-4642-bdec-7dd0d3cc3106/volumes/kubernetes.io~projected/kube-api-access-wksts
tmpfs           271G   13k  271G   1% /var/lib/kubelet/pods/813190f4-6aeb-4b89-aa2d-81a1931f2ddc/volumes/kubernetes.io~projected/kube-api-access-zdtlt
tmpfs           271G   13k  271G   1% /var/lib/kubelet/pods/93309371-8572-49ac-83eb-5fe4c7ef8466/volumes/kubernetes.io~projected/kube-api-access-4tlcf
tmpfs           271G   13k  271G   1% /var/lib/kubelet/pods/87ece495-1d23-47e6-886f-493ed16bf7c5/volumes/kubernetes.io~projected/kube-api-access-tkzcm
tmpfs           271G   13k  271G   1% /var/lib/kubelet/pods/3b272fd8-3e35-415c-9dbe-f585a7664341/volumes/kubernetes.io~secret/webhook-cert
tmpfs           271G   13k  271G   1% /var/lib/kubelet/pods/3b272fd8-3e35-415c-9dbe-f585a7664341/volumes/kubernetes.io~projected/kube-api-access-472cx
tmpfs           271G     0  271G   0% /var/lib/kubelet/pods/d0d8d47c-69c8-4b48-9702-232e92febcdb/volumes/kubernetes.io~secret/longhorn-grpc-tls
tmpfs           271G     0  271G   0% /var/lib/kubelet/pods/da4c88fc-5c77-4816-a7e4-dd793bf78a3c/volumes/kubernetes.io~secret/longhorn-grpc-tls
tmpfs           271G   13k  271G   1% /var/lib/kubelet/pods/daec1f31-6c1d-4f9e-aa7b-89ac59a74b0d/volumes/kubernetes.io~projected/kube-api-access-gtmjk
tmpfs           271G   13k  271G   1% /var/lib/kubelet/pods/bcd59922-8cfb-4ab6-880e-1567b47ac988/volumes/kubernetes.io~projected/kube-api-access-8cvln
tmpfs           271G   13k  271G   1% /var/lib/kubelet/pods/37a10d03-68a8-42ac-a49c-93966d90ab92/volumes/kubernetes.io~projected/kube-api-access-lsm4q
tmpfs           271G   13k  271G   1% /var/lib/kubelet/pods/d0d8d47c-69c8-4b48-9702-232e92febcdb/volumes/kubernetes.io~projected/kube-api-access-lhgzs
tmpfs           271G   13k  271G   1% /var/lib/kubelet/pods/da4c88fc-5c77-4816-a7e4-dd793bf78a3c/volumes/kubernetes.io~projected/kube-api-access-qk54g
tmpfs            28G     0   28G   0% /run/user/0
a

ancient-pizza-13099

10/28/2022, 10:31 AM
please try
rm -rf /usr/local/upgrade_tmp/tmp.9gJwps92DB
ls -alth /usr/local/upgrade_tmp/
d

damp-crayon-64796

10/28/2022, 10:31 AM
harvester1:/tmp #  rm -rf /usr/local/upgrade_tmp/tmp.9gJwps92DB
harvester1:/tmp # ls -alth /usr/local/upgrade_tmp/
total 2.9G
drwxr-xr-x  2 root root 4.0K Oct 28 10:31 .
drwxr-xr-x 10 root root 4.0K Oct 26 10:29 ..
-rw-------  1 root root 492M Oct 26 10:29 tmp.SG1NQ0bSVk
-rw-------  1 root root 492M Oct 26 10:29 tmp.W8Ou1bUAuE
-rw-------  1 root root 492M Oct 26 10:28 tmp.M4QyyFvaoR
-rw-------  1 root root 492M Oct 26 10:28 tmp.GlhwlNnHcS
-rw-------  1 root root 492M Oct 26 10:28 tmp.nLxl4leBp1
-rw-------  1 root root 492M Oct 26 10:28 tmp.cGa78ygm56
a

ancient-pizza-13099

10/28/2022, 10:32 AM
mkdir /tmp/tmprootfs_1 mount /usr/local/upgrade_tmp/tmp.SG1NQ0bSVk /tmp/tmprootfs_1 umount /tmp/tmprootfs_1
let's try to mount and umount
d

damp-crayon-64796

10/28/2022, 10:34 AM
done this without any issue on the host OR should i do it in the post_drain pod?
a

ancient-pizza-13099

10/28/2022, 10:36 AM
maybe
chroot $HOST_DIR elemental upgrade --directory ${tmp_rootfs_mount#"$HOST_DIR"}
the true upgrade itself makes fs can not be
umount
or
rm
fail
let me figure out how to start a new job to do those commands
journalctl -k | grep mount
check kernal message , we just did manually mount
for support-bundle, it get
Oct 26 10:28:33 harvester1 kernel: EXT4-fs (sdb4): mounted filesystem with ordered data mode. Opts: (null)
d

damp-crayon-64796

10/28/2022, 10:48 AM
Oct 05 06:20:56 harvester1 systemd[1]: sysroot-oem.mount: Succeeded.
Oct 05 06:20:56 harvester1 systemd[1]: sysroot-var.mount: Succeeded.
Oct 05 06:20:56 harvester1 systemd[1]: sysroot.mount: Succeeded.
Oct 05 06:20:58 harvester1 kernel: EXT4-fs (sdh): mounted filesystem with ordered data mode. Opts: (null)
Oct 05 06:25:59 harvester1 kernel: EXT4-fs (sdj): mounted filesystem with ordered data mode. Opts: (null)
Oct 26 10:28:33 harvester1 kernel: EXT4-fs (sdb4): mounted filesystem with ordered data mode. Opts: (null)
Oct 26 10:28:40 harvester1 kernel: EXT4-fs (sdb4): mounted filesystem with ordered data mode. Opts: (null)
Oct 26 10:28:47 harvester1 kernel: EXT4-fs (sdb4): mounted filesystem with ordered data mode. Opts: (null)
Oct 26 10:28:54 harvester1 kernel: EXT4-fs (sdb4): mounted filesystem with ordered data mode. Opts: (null)
Oct 26 10:29:01 harvester1 kernel: EXT4-fs (sdb4): mounted filesystem with ordered data mode. Opts: (null)
Oct 26 10:29:08 harvester1 kernel: EXT4-fs (sdb4): mounted filesystem with ordered data mode. Opts: (null)
Oct 26 10:29:15 harvester1 kernel: EXT4-fs (sdb4): mounted filesystem with ordered data mode. Opts: (null)
a

ancient-pizza-13099

10/28/2022, 10:51 AM
df -H | grep sd
d

damp-crayon-64796

10/28/2022, 10:51 AM
harvester1:/tmp # df -H  | grep sd
/dev/sdg3        16G  4.7G   11G  31% /run/initramfs/cos-state
/dev/sdg2        61M  1.6M   55M   3% /oem
/dev/sdg5       102G   59G   39G  61% /usr/local
/dev/sdh        6.5T  346G  5.8T   6% /var/lib/harvester/defaultdisk
a

ancient-pizza-13099

10/28/2022, 10:54 AM
Then those
Oct 26 10:28:33 harvester1 kernel: EXT4-fs (sdb4): mounted filesystem with ordered data mode. Opts: (null)
logs are caused by this line
chroot $HOST_DIR elemental upgrade --directory ${tmp_rootfs_mount#"$HOST_DIR"}
p

prehistoric-balloon-31801

10/28/2022, 1:38 PM
@damp-crayon-64796 Could you check if they are all identical files? :
sha256sum /usr/local/upgrade_tmp/*
d

damp-crayon-64796

10/28/2022, 1:40 PM
Could check this later in the evening
p

prehistoric-balloon-31801

10/28/2022, 1:40 PM
@ancient-pizza-13099 If they are identical, it means jobs are actually retrying to download those squash image files and we might have non-idempotent codes (umount/mount) or
elmental upgrade
exit 1. Prove your suspicion.
It’s not hurry and thanks for being with us!
a

ancient-pizza-13099

10/28/2022, 2:22 PM
I log https://github.com/harvester/harvester/issues/3070 to track this issue.
❤️ 1
please also post the full text of
df -H
, thanks
d

damp-crayon-64796

10/28/2022, 4:12 PM
sha256sum is all the same
rancher@harvester1:/usr/local/upgrade_tmp> sudo sha256sum *
23130ebe608ae968cc1346c630fe5079148aba8e8420ccf82b559ef6a8b72b51  tmp.GlhwlNnHcS
23130ebe608ae968cc1346c630fe5079148aba8e8420ccf82b559ef6a8b72b51  tmp.M4QyyFvaoR
23130ebe608ae968cc1346c630fe5079148aba8e8420ccf82b559ef6a8b72b51  tmp.SG1NQ0bSVk
23130ebe608ae968cc1346c630fe5079148aba8e8420ccf82b559ef6a8b72b51  tmp.W8Ou1bUAuE
23130ebe608ae968cc1346c630fe5079148aba8e8420ccf82b559ef6a8b72b51  tmp.cGa78ygm56
23130ebe608ae968cc1346c630fe5079148aba8e8420ccf82b559ef6a8b72b51  tmp.nLxl4leBp1
df -h
rancher@harvester1:/usr/local/upgrade_tmp> df -h
Filesystem      Size  Used Avail Use% Mounted on
devtmpfs        4.0M     0  4.0M   0% /dev
tmpfs           126G     0  126G   0% /dev/shm
tmpfs            51G   16M   51G   1% /run
tmpfs           4.0M     0  4.0M   0% /sys/fs/cgroup
/dev/sdg3        15G  4.4G  9.7G  31% /run/initramfs/cos-state
/dev/loop0      3.0G  1.3G  1.6G  45% /
tmpfs            63G   12M   63G   1% /run/overlay
overlay          63G   12M   63G   1% /boot
overlay          63G   12M   63G   1% /etc
/dev/sdg2        58M  1.5M   53M   3% /oem
overlay          63G   12M   63G   1% /srv
/dev/sdg5        95G   55G   36G  61% /usr/local
overlay          63G   12M   63G   1% /var
tmpfs           126G  7.8M  126G   1% /tmp
/dev/sdh        5.9T  323G  5.3T   6% /var/lib/harvester/defaultdisk
shm              64M     0   64M   0% /run/k3s/containerd/io.containerd.grpc.v1.cri/sandboxes/b15f579314aa85d0fca77774c99544d1ce16663210246480686bafb6c46efb37/shm
shm              64M     0   64M   0% /run/k3s/containerd/io.containerd.grpc.v1.cri/sandboxes/43d29b69c82c29c5e9f97053d9488e0b92e86fe9003a71d6fcf0b765ffce176f/shm
shm              64M     0   64M   0% /run/k3s/containerd/io.containerd.grpc.v1.cri/sandboxes/e8402c94c6d3050431bed960802a740c9018ad58c630bcf80d7adf79ac4da00c/shm
shm              64M     0   64M   0% /run/k3s/containerd/io.containerd.grpc.v1.cri/sandboxes/5108a0bec9afcfbd8300f247166aa14904df44cba18feee2b7c2e8dd3815a239/shm
shm              64M     0   64M   0% /run/k3s/containerd/io.containerd.grpc.v1.cri/sandboxes/41d6b24c399d5d9b790f684d4e42226b225b61d90e4e95ec4e6440be2740f5ee/shm
shm              64M     0   64M   0% /run/k3s/containerd/io.containerd.grpc.v1.cri/sandboxes/0ab5febe5d5d47e2ae647794d3c5f9c90f29b11f52e64c11df8267daac3ccaa5/shm
shm              64M     0   64M   0% /run/k3s/containerd/io.containerd.grpc.v1.cri/sandboxes/e7646afb2cc8f7bc1ce8d85980eb0f711fb39a4876168b23e896604efe6877d8/shm
shm              64M     0   64M   0% /run/k3s/containerd/io.containerd.grpc.v1.cri/sandboxes/c2e5a62b3cc7bb107b97a03b420c20e986036a8a73e005fb42c82f9bef82b1bd/shm
shm              64M     0   64M   0% /run/k3s/containerd/io.containerd.grpc.v1.cri/sandboxes/d725dc2fcd0c2b142a0c05cd9785b1a95d5d6ebbf30f6413565f34cef3964616/shm
shm              64M     0   64M   0% /run/k3s/containerd/io.containerd.grpc.v1.cri/sandboxes/1eb0d3f179445d3e91940f30bf0cce091e386daa20541bae73b1573ea615a520/shm
shm              64M     0   64M   0% /run/k3s/containerd/io.containerd.grpc.v1.cri/sandboxes/60e4d4a83aa0aea2be75d23b24ea746a4eac8b178ccde87fe1fb2364b7faf310/shm
shm              64M     0   64M   0% /run/k3s/containerd/io.containerd.grpc.v1.cri/sandboxes/a619f96c3a2f25a1ec0e7b8338fbf4819be3bd60d7fb6c884a3b3f61db0238da/shm
shm              64M     0   64M   0% /run/k3s/containerd/io.containerd.grpc.v1.cri/sandboxes/727702241eba838d8025ce1007e68ff5a43504111486d0ea4848eaaa278d5d9e/shm
shm              64M     0   64M   0% /run/k3s/containerd/io.containerd.grpc.v1.cri/sandboxes/2782eba80cec6102e140a10741e8414e49bc621f6cb8df8fe7a92d0dabc0bc9a/shm
shm              64M     0   64M   0% /run/k3s/containerd/io.containerd.grpc.v1.cri/sandboxes/ccb164434563f8710c23f7e7a56e2436c26ff6c586bb206cdb3bf7218d58a480/shm
shm              64M     0   64M   0% /run/k3s/containerd/io.containerd.grpc.v1.cri/sandboxes/e5385bcc3f459e1f2ab2ea6ca808d1f7a51c58cc236180534a5c73fc1e8eebc3/shm
shm              64M     0   64M   0% /run/k3s/containerd/io.containerd.grpc.v1.cri/sandboxes/4ce8768cce6791a48183bffebf238efe50d51e931c1666d1b83f2135e4c4c69d/shm
shm              64M     0   64M   0% /run/k3s/containerd/io.containerd.grpc.v1.cri/sandboxes/8cdd58778d8210f875e149e35284f61e3656f393257096dbb5d0a0e4f8e0984f/shm
shm              64M     0   64M   0% /run/k3s/containerd/io.containerd.grpc.v1.cri/sandboxes/1febf6fb150c0b58492204e392d7efecfc4c659dba6772265f6965775ffd148c/shm
shm              64M     0   64M   0% /run/k3s/containerd/io.containerd.grpc.v1.cri/sandboxes/ac38a6e848d1fdba34878508e9243af926ed04aee15ca45c5a4da3455d3e82a9/shm
shm              64M     0   64M   0% /run/k3s/containerd/io.containerd.grpc.v1.cri/sandboxes/aa4161a4fd2d1fdfe531b8fd7a98bc3e538c59d25657c2faf091b10169079f27/shm
shm              64M     0   64M   0% /run/k3s/containerd/io.containerd.grpc.v1.cri/sandboxes/aa44b0577e238dedc45556072133c7be31aad5e2c50a2e7bae0d84e31b681170/shm
shm              64M     0   64M   0% /run/k3s/containerd/io.containerd.grpc.v1.cri/sandboxes/8ff3d1e46c98b96014ee73d0733f6494702fbcba2ce75b2c8a176e6662b5a4a6/shm
shm              64M     0   64M   0% /run/k3s/containerd/io.containerd.grpc.v1.cri/sandboxes/aa07d70417dda4c3661446b31e1e522049194dbf38e88b37ef086e1493b1feb3/shm
shm              64M     0   64M   0% /run/k3s/containerd/io.containerd.grpc.v1.cri/sandboxes/81687f56475268f15d7cb874f7e44d7b266368e94ef93836a1598a9f5ff373ee/shm
shm              64M     0   64M   0% /run/k3s/containerd/io.containerd.grpc.v1.cri/sandboxes/228f6bdabb4bbeb79d7957a5da41facdbbf2c9cefb75d4828fd0e0b6d139e712/shm
shm              64M     0   64M   0% /run/k3s/containerd/io.containerd.grpc.v1.cri/sandboxes/3077eef294e359afbd57246098a9cd0459f2f2ef4c0da9a7627fe23b5b9b0e2c/shm
shm              64M     0   64M   0% /run/k3s/containerd/io.containerd.grpc.v1.cri/sandboxes/9c21db0deb8cfebe279f051c782bae946f8bf22b27fc4f02158d653aacedb656/shm
shm              64M     0   64M   0% /run/k3s/containerd/io.containerd.grpc.v1.cri/sandboxes/d8e1a1e9d865de92a5301ccaee6aeeb95797ac824e3615f096174004f541687c/shm
shm              64M     0   64M   0% /run/k3s/containerd/io.containerd.grpc.v1.cri/sandboxes/6db76a72de018607fd9844ac2dadf8860a4aa5ebcc42b46f0dbc4cd8e49df70b/shm
rancher@harvester1:/usr/local/upgrade_tmp>
for the github issue .... it is a 3 node cluster
[skoch@mgmt-sk ~]$ k get no -o wide
NAME         STATUS   ROLES                       AGE   VERSION           INTERNAL-IP   EXTERNAL-IP   OS-IMAGE           KERNEL-VERSION                CONTAINER-RUNTIME
harvester1   Ready    control-plane,etcd,master   74d   v1.24.7+rke2r1    10.1.35.91    <none>        Harvester v1.0.3   5.3.18-150300.59.87-default   <containerd://1.6.8-k3s1>
harvester3   Ready    control-plane,etcd,master   74d   v1.22.12+rke2r1   10.1.35.93    <none>        Harvester v1.0.3   5.3.18-150300.59.87-default   <containerd://1.5.13-k3s1>
harvester4   Ready    control-plane,etcd,master   74d   v1.22.12+rke2r1   10.1.35.94    <none>        Harvester v1.0.3   5.3.18-150300.59.87-default   <containerd://1.5.13-k3s1>
i uncordened the failed node
✔️ 1
saw your last comment in the issue, therefore:
harvester1:~ # lsblk
NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
loop0    7:0    0    3G  1 loop /
sda      8:0    1 29.9G  0 disk
sdb      8:16   0  120G  0 disk
├─sdb1   8:17   0   64M  0 part
├─sdb2   8:18   0   64M  0 part
├─sdb3   8:19   0   15G  0 part
├─sdb4   8:20   0    8G  0 part
└─sdb5   8:21   0 96.9G  0 part
sdc      8:32   0  5.9T  0 disk
sdd      8:48   0  120G  0 disk
├─sdd1   8:49   0   64M  0 part
├─sdd2   8:50   0   64M  0 part
├─sdd3   8:51   0   15G  0 part
├─sdd4   8:52   0    8G  0 part
└─sdd5   8:53   0 96.9G  0 part
sde      8:64   0  120G  0 disk
├─sde1   8:65   0   64M  0 part
├─sde2   8:66   0   64M  0 part
├─sde3   8:67   0   15G  0 part
├─sde4   8:68   0    8G  0 part
└─sde5   8:69   0 96.9G  0 part
sdf      8:80   0  5.9T  0 disk
sdg      8:96   0  120G  0 disk
├─sdg1   8:97   0   64M  0 part
├─sdg2   8:98   0   64M  0 part /oem
├─sdg3   8:99   0   15G  0 part /run/initramfs/cos-state
├─sdg4   8:100  0    8G  0 part
└─sdg5   8:101  0 96.9G  0 part /usr/local
sdh      8:112  0  5.9T  0 disk /var/lib/harvester/defaultdisk
sdi      8:128  0  5.9T  0 disk
sdk      8:160  0   40G  0 disk
└─sdk1   8:161  0   40G  0 part
what is a bit tricky is that we have FC storage and the disks are seen multiple times, but mulitpath isnt configured on the OS, so you see each disks 4 times
harvester1:~ # lsblk| grep ^sd | grep 5.9
sdc      8:32   0  5.9T  0 disk
sdf      8:80   0  5.9T  0 disk
sdh      8:112  0  5.9T  0 disk /var/lib/harvester/defaultdisk
sdi      8:128  0  5.9T  0 disk
harvester1:~ # lsblk| grep ^sd | grep 120
sdb      8:16   0  120G  0 disk
sdd      8:48   0  120G  0 disk
sde      8:64   0  120G  0 disk
sdg      8:96   0  120G  0 disk
a

ancient-pizza-13099

10/28/2022, 6:08 PM
thanks. the
disks
are tricky, we are checking if thus will cause
elemental upgrade
fail
lsblk -o NAME,LABEL,PARTLABEL
please
d

damp-crayon-64796

10/28/2022, 6:40 PM
harvester1:~ # lsblk -o NAME,LABEL,PARTLABEL
NAME   LABEL           PARTLABEL
loop0  COS_ACTIVE
sda
sdb
├─sdb1 COS_GRUB        p.grub
├─sdb2 COS_OEM         p.oem
├─sdb3 COS_STATE       p.state
├─sdb4 COS_RECOVERY    p.recovery
└─sdb5 COS_PERSISTENT  p.persistent
sdc    HARV_LH_DEFAULT
sdd
├─sdd1 COS_GRUB        p.grub
├─sdd2 COS_OEM         p.oem
├─sdd3 COS_STATE       p.state
├─sdd4 COS_RECOVERY    p.recovery
└─sdd5 COS_PERSISTENT  p.persistent
sde
├─sde1 COS_GRUB        p.grub
├─sde2 COS_OEM         p.oem
├─sde3 COS_STATE       p.state
├─sde4 COS_RECOVERY    p.recovery
└─sde5 COS_PERSISTENT  p.persistent
sdf    HARV_LH_DEFAULT
sdg
├─sdg1 COS_GRUB        p.grub
├─sdg2 COS_OEM         p.oem
├─sdg3 COS_STATE       p.state
├─sdg4 COS_RECOVERY    p.recovery
└─sdg5 COS_PERSISTENT  p.persistent
sdh    HARV_LH_DEFAULT
sdi    HARV_LH_DEFAULT
sdk
└─sdk1
👍 1
a

ancient-pizza-13099

10/28/2022, 6:55 PM
@prehistoric-balloon-31801 Is it possible for us to manually start a job to run only those few lines of shell code, to trigger the
elemental upgrade
with those existing tmp file ?
tmp_rootfs_mount=$(mktemp -d -p $HOST_DIR/tmp)
  mount $tmp_rootfs_squashfs $tmp_rootfs_mount

  chroot $HOST_DIR elemental upgrade --directory ${tmp_rootfs_mount#"$HOST_DIR"}
  umount $tmp_rootfs_mount
  rm -rf $tmp_rootfs_squashfs

  umount -R /run
d

damp-crayon-64796

10/28/2022, 7:28 PM
if we set
export rootfs_squashfs=/host/usr/local/upgrade_tmp/tmp.nLxl4leBp1
and
HOST_DIR=/host
i think we could execute this commands in the post_drain pod which is still running. Do you think this work ?
a

ancient-pizza-13099

10/28/2022, 7:34 PM
The curretnt post-drain POD will be blocked in waiting repo-vm
we need to hack the pod, to start directly from the desired code block
We will do some tests to make sure it works, and then try in your
harvester
🙂
p

prehistoric-balloon-31801

10/31/2022, 2:05 AM
We can test that command (by backing up the current os image):
# backup
cp /run/initramfs/cos-state/cOS/active.img /usr/local/active.img.bak

# upgrade
mkdir /tmp/new_root
mount /usr/local/upgrade_tmp/tmp.GlhwlNnHcS /tmp/new_root
elemental upgrade --directory /tmp/new_root
I added another disk with the exact layout to a running system and did the upgrade and it indeed cause issues. elemental command tries to mount the first COS_STATE partition.
d

damp-crayon-64796

11/01/2022, 12:54 PM
thanks a lot. So what i could do is to disable 3 of the 4 pathes to the disk and then do the above commands. Not sure if i would need an reboot, but perhaps it will work. If this would work then I just need to know how to proceed with node 2+3. What do you think ?
p

prehistoric-balloon-31801

11/02/2022, 9:19 AM
Hi Stephan, do you know why there are multiple disks with identical partitions? In fact, this might not be a good idea because the booted system is uncertain, it could run into a wrong system or even mount a wrong persistent partition.
d

damp-crayon-64796

11/02/2022, 9:20 AM
this are not multiple disks, it is one disk which is seen through different pathes
p

prehistoric-balloon-31801

11/02/2022, 9:21 AM
Got it, thanks. Is it possible to disable other paths? Harvester can’t support multipath disks at this moment.
d

damp-crayon-64796

11/02/2022, 9:22 AM
yes, thats what i suggested above
p

prehistoric-balloon-31801

11/02/2022, 9:22 AM
(We’ll work a fix to choose the right partition, but still it’s better not to have so many paths)
d

damp-crayon-64796

11/02/2022, 9:24 AM
yes, understand, if/when you want to support for fibre channel disks i think you must allow to use multipath daemon and use /dev/mapper-Volumes instead of /dev/sd-Devices
Do you think i should disable the pathes and then try the elemental upgrade as you suggested above?
p

prehistoric-balloon-31801

11/02/2022, 9:29 AM
You can disable the paths and restart the upgrade again. (can’t “resume” in the middle). I’ll write a brief procedure how to do that.
✔️ 1
d

damp-crayon-64796

11/02/2022, 10:03 AM
I am at step2 and unsure: You want me to delete the red marked lines ? Right? and not in the annotations ? Ans what is meant with the post-drain hooks ?
p

prehistoric-balloon-31801

11/02/2022, 10:04 AM
I think your configuration looks good and don’t need to do anything. what’s the cluster state?
kubectl get <http://clusters.cluster.x-k8s.io|clusters.cluster.x-k8s.io> local -n fleet-local
d

damp-crayon-64796

11/02/2022, 10:05 AM
[skoch@mgmt-sk ~]$ kubectl get <http://clusters.cluster.x-k8s.io|clusters.cluster.x-k8s.io> local -n fleet-local
NAME    PHASE          AGE   VERSION
local   Provisioning   79d
p

prehistoric-balloon-31801

11/02/2022, 10:07 AM
Could you run the “./drain-status.sh” script, it can help determine current state.
You can skip “2. Edit cluster and remove pre-drain and post-drain hooks.“, I updated the gist too.
d

damp-crayon-64796

11/02/2022, 10:18 AM
[skoch@mgmt-sk harv_up]$ . ./drain-status.sh

harvester1 (custom-6a3a7673cfe4)
  rke-pre-drain: {"IgnoreErrors":false,"deleteEmptyDirData":true,"disableEviction":false,"enabled":true,"force":true,"gracePeriod":0,"ignoreDaemonSets":true,"postDrainHooks":[{"annotation":"<http://harvesterhci.io/post-hook|harvesterhci.io/post-hook>"}],"preDrainHooks":[{"annotation":"<http://harvesterhci.io/pre-hook|harvesterhci.io/pre-hook>"}],"skipWaitForDeleteTimeoutSeconds":0,"timeout":0}
  harvester-pre-hook {"IgnoreErrors":false,"deleteEmptyDirData":true,"disableEviction":false,"enabled":true,"force":true,"gracePeriod":0,"ignoreDaemonSets":true,"postDrainHooks":[{"annotation":"<http://harvesterhci.io/post-hook|harvesterhci.io/post-hook>"}],"preDrainHooks":[{"annotation":"<http://harvesterhci.io/pre-hook|harvesterhci.io/pre-hook>"}],"skipWaitForDeleteTimeoutSeconds":0,"timeout":0}
  rke-post-drain: {"IgnoreErrors":false,"deleteEmptyDirData":true,"disableEviction":false,"enabled":true,"force":true,"gracePeriod":0,"ignoreDaemonSets":true,"postDrainHooks":[{"annotation":"<http://harvesterhci.io/post-hook|harvesterhci.io/post-hook>"}],"preDrainHooks":[{"annotation":"<http://harvesterhci.io/pre-hook|harvesterhci.io/pre-hook>"}],"skipWaitForDeleteTimeoutSeconds":0,"timeout":0}
  harvester-post-hook: null

harvester3 (custom-7a17ad8fa75f)
  rke-pre-drain: null
  harvester-pre-hook null
  rke-post-drain: null
  harvester-post-hook: null

harvester4 (custom-ef4fd4a88161)
  rke-pre-drain: null
  harvester-pre-hook null
  rke-post-drain: null
  harvester-post-hook: null
p

prehistoric-balloon-31801

11/02/2022, 10:18 AM
You can just do
./post-drain.sh harvester1
and the cluster should back to “Provisioned” later
d

damp-crayon-64796

11/02/2022, 10:21 AM
[skoch@mgmt-sk harv_up]$ . ./post-drain.sh harvester1
<http://harvester.cattle.io/post-hook|harvester.cattle.io/post-hook>: '{"IgnoreErrors":false,"deleteEmptyDirData":true,"disableEviction":false,"enabled":true,"force":true,"gracePeriod":0,"ignoreDaemonSets":true,"postDrainHooks":[{"annotation":"<http://harvesterhci.io/post-hook|harvesterhci.io/post-hook>"}],"preDrainHooks":[{"annotation":"<http://harvesterhci.io/pre-hook|harvesterhci.io/pre-hook>"}],"skipWaitForDeleteTimeoutSeconds":0,"timeout":0}'
secret/custom-6a3a7673cfe4-machine-plan annotated
[skoch@mgmt-sk harv_up]$ kubectl get <http://clusters.cluster.x-k8s.io|clusters.cluster.x-k8s.io> local -n fleet-local
NAME    PHASE          AGE   VERSION
local   Provisioning   79d
It took some time
[skoch@mgmt-sk harv_up]$ kubectl get <http://clusters.cluster.x-k8s.io|clusters.cluster.x-k8s.io> local -n fleet-local
NAME    PHASE         AGE   VERSION
local   Provisioned   79d
Just in the Gui the state is still failed
should i follow now the procedure at "#start-over-an-upgrade" ?
p

prehistoric-balloon-31801

11/02/2022, 10:42 AM
Exactly
d

damp-crayon-64796

11/02/2022, 10:46 AM
okay, did this, and the Upgrade button in the gui disappeared now. I think as the cluster is already on 1.1.0 the upgrade will not be offered anymore (?)
p

prehistoric-balloon-31801

11/02/2022, 10:52 AM
d

damp-crayon-64796

11/02/2022, 10:57 AM
unfortunatly does not appear
[skoch@mgmt-sk harv_up]$ k apply -f <https://releases.rancher.com/harvester/v1.1.0/version.yaml>
<http://version.harvesterhci.io/v1.1.0|version.harvesterhci.io/v1.1.0> created
[skoch@mgmt-sk harv_up]$ k get <http://version.harvesterhci.io/v1.1.0|version.harvesterhci.io/v1.1.0>
NAME     ISO-URL                                                                    RELEASEDATE   MINUPGRADABLEVERSION
v1.1.0   <https://releases.rancher.com/harvester/v1.1.0/harvester-v1.1.0-amd64.iso>   20221025
sorry, looked at the wrong place. see it now
first host succeeded now, but the second lost his ip during rebot. node is up and i think upgraded, but has no network
i had to copy and adjusted these files from first node: ifcfg-mgmt-bo and ifcfg-mgmt-br and did an ifup mgmt-bo The files were missing at all. Then the upgrade succeed. Could you help me to make these files permanent please ?
p

prehistoric-balloon-31801

11/02/2022, 2:12 PM
I think you might hit this: https://github.com/harvester/harvester/issues/3045 Do you have bonding with multiple NICs?
d

damp-crayon-64796

11/02/2022, 2:55 PM
yes, indeed, this fixed the issue THANKS so much to both of you for helping here and figuring out that the issue was on my side trying an configuration which isnt supposed to work. Great experience to work with you.
👍 1
💯 1