This message was deleted Rancher Users #harvester

Join Slack

This message was deleted.

# harvester

adamant-kite-43734

05/25/2025, 11:31 AM

This message was deleted.

bland-article-62755

05/25/2025, 2:52 PM

Did the pre-check script pass all the tests?

👍 1

bland-article-62755

05/25/2025, 2:53 PM

What do the logs and events from the crash looping pod say?

future-gigabyte-33261

05/25/2025, 8:50 PM

yes

future-gigabyte-33261

05/25/2025, 8:51 PM

i fixed the error by removing the other logging cluster outputs and Flow

future-gigabyte-33261

05/25/2025, 8:53 PM

but now its stuck and its not crating the jobs in the cattle-system namespace

future-gigabyte-33261

05/25/2025, 8:54 PM

an its stuck in this state

future-gigabyte-33261

05/25/2025, 8:56 PM

according to the docs https://docs.harvesterhci.io/v1.4/upgrade/troubleshooting#phase-2-preload-container-images

future-gigabyte-33261

05/25/2025, 8:59 PM

its been 2 years now running harvester in prod and each upgrade its becoming a nightmare

👀 1

red-king-19196

05/26/2025, 3:33 AM

Hey @future-gigabyte-33261, could you generate a support bundle for us to assess the situation?

future-gigabyte-33261

05/26/2025, 2:17 PM

Yes

bland-article-62755

05/26/2025, 4:24 PM

I normally start rebooting the nodes that are stuck after verifying they are drained from active VMs. I feel like it's 50/50 on if it will fix things.

future-gigabyte-33261

05/26/2025, 4:33 PM

I will try tomorrow and also generate the support bundle since today i cant access the infra

prehistoric-morning-49258

05/26/2025, 7:12 PM

always be ready to nuke the cluster is my motto 😅

future-gigabyte-33261

05/27/2025, 4:26 PM

@red-king-19196

few-appointment-23216

05/29/2025, 8:38 AM

Hello @red-king-19196! @future-gigabyte-33261 and I tried to investigate a bit on our own, but basically we only restarted the nodes a couple of times. We noticed that the etcd database size was not matching across the nodes, and that there was a sudden increase in size at least in one of the nodes when the upgrade started (a jump from 77MB to 1.6GB). Besides this, the rke2-server service is failing to start on all nodes, throwing 502 and 503 errors.

few-appointment-23216

05/29/2025, 8:53 AM

ideally, we'd prefer to continue with the upgrade to 1.4.2 (and then to 1.4.3). However, if this is not possible with the current state, we'd prefer to roll back to the state previous to the upgrade and start again from there.

bland-article-62755

05/29/2025, 2:09 PM

etcd may just need to get defragged

bland-article-62755

05/29/2025, 2:10 PM

Are you monitoring events?

bland-article-62755

05/29/2025, 2:11 PM

watch kubectl get events -A

future-gigabyte-33261

05/29/2025, 2:16 PM

I cant use kubectl

bland-article-62755

05/29/2025, 2:16 PM

O.o

bland-article-62755

05/29/2025, 2:16 PM

why not?

future-gigabyte-33261

05/29/2025, 2:17 PM

Rke2 service issues

future-gigabyte-33261

05/29/2025, 2:17 PM

1sec

bland-article-62755

05/29/2025, 2:19 PM

Uh. That's not good

bland-article-62755

05/29/2025, 2:20 PM

Looking at the harvester UI, are you on a node it currently thinks is the control plane?

future-gigabyte-33261

05/29/2025, 2:20 PM

i have lost the ui

future-gigabyte-33261

05/29/2025, 2:20 PM

the cluster is degraded

future-gigabyte-33261

05/29/2025, 2:20 PM

this is node 1 and its supposed to be main node

bland-article-62755

05/29/2025, 2:20 PM

I'd log into all the other nodes and see if you have kubectl from any of them.

future-gigabyte-33261

05/29/2025, 2:21 PM

let me do that

future-gigabyte-33261

05/29/2025, 2:21 PM

1 sec

future-gigabyte-33261

05/29/2025, 2:22 PM

ok i have in node 3

future-gigabyte-33261

05/29/2025, 2:22 PM

the wired issue is that it comes

future-gigabyte-33261

05/29/2025, 2:22 PM

and then it dissaperas

bland-article-62755

05/29/2025, 2:26 PM

Network failures

future-gigabyte-33261

05/29/2025, 2:27 PM

and now its gone agian

bland-article-62755

05/29/2025, 2:27 PM

start checking

systemctl

for the server processes and make sure they're running properly and/or journald

bland-article-62755

05/29/2025, 2:27 PM

sounds like rke2 is crashing

future-gigabyte-33261

05/29/2025, 2:28 PM

yep

future-gigabyte-33261

05/29/2025, 2:32 PM

@bland-article-62755 node1 is master and its 10.10.100.1 node3

future-gigabyte-33261

05/29/2025, 2:33 PM

meanwhile in node 2 is starting the rke2-server service

bland-article-62755

05/29/2025, 2:35 PM

I've only recovered etcd from split brain once and we needed our Suse support contract to do it.

🦜 1

bland-article-62755

05/29/2025, 2:35 PM

It sounds like you might have rebooted the etcd nodes before they could recover and it's now in a bad state.

future-gigabyte-33261

05/29/2025, 2:35 PM

and this has always confused when we set this cluster up we did the following: 1. setup node 1 create cluster 2. setup node 2 join into the cluster 3. setup node 3 join into the cluster

bland-article-62755

05/29/2025, 2:36 PM

Yeah, after node2 and node3 join, they get promoted so you have HA.

future-gigabyte-33261

05/29/2025, 2:37 PM

ye but in node 2 is running rke2-server

future-gigabyte-33261

05/29/2025, 2:37 PM

in node 3 is running rke2-agent

future-gigabyte-33261

05/29/2025, 2:37 PM

service

bland-article-62755

05/29/2025, 2:37 PM

but if you turn them all off at the same time before etcd is healthy bad things™️ happen

future-gigabyte-33261

05/29/2025, 2:39 PM

i restarted after it was stuck as you said here https://rancher-users.slack.com/archives/C01GKHKAG0K/p1748276680310189?thread_ts=1748172678.294309&cid=C01GKHKAG0K

future-gigabyte-33261

05/29/2025, 2:39 PM

and all the vms where of

future-gigabyte-33261

05/29/2025, 2:39 PM

and the node was cordoned

future-gigabyte-33261

05/29/2025, 2:39 PM

i restarted it

bland-article-62755

05/29/2025, 2:40 PM

but basically we only restarted the nodes a couple of times.

future-gigabyte-33261

05/29/2025, 2:40 PM

thats what i did also

future-gigabyte-33261

05/29/2025, 2:41 PM

nothing else

bland-article-62755

05/29/2025, 2:41 PM

that's nodes plural

future-gigabyte-33261

05/29/2025, 2:41 PM

`> for ip in 10.10.100.10 10.10.100.20 10.10.100.30; do ssh rancher@$ip "sudo su -c 'reboot'"; done``

future-gigabyte-33261

05/29/2025, 2:42 PM

at first i did only node onw

future-gigabyte-33261

05/29/2025, 2:42 PM

then i stayed in the same state

future-gigabyte-33261

05/29/2025, 2:42 PM

then

future-gigabyte-33261

05/29/2025, 2:42 PM

reboot all

bland-article-62755

05/29/2025, 2:42 PM

Well at least we know what's wrong

future-gigabyte-33261

05/29/2025, 2:43 PM

im still not understanding whats wrong

🙇‍♂️ 1

bland-article-62755

05/29/2025, 2:43 PM

you turned off all the nodes at the same time and etcd is failing.

bland-article-62755

05/29/2025, 2:44 PM

https://ranchermanager.docs.rancher.com/troubleshooting/kubernetes-components/troubleshooting-etcd-nodes

bland-article-62755

05/29/2025, 2:47 PM

Do you have etcd backups?

bland-article-62755

05/29/2025, 2:49 PM

https://harvesterhci.io/kb/ is also going to be helpful

future-gigabyte-33261

05/29/2025, 2:52 PM

in external storage no

future-gigabyte-33261

05/29/2025, 2:52 PM

but we have the snapshots

bland-article-62755

05/29/2025, 2:53 PM

worst case you can do a

--cluster-reset

then restore one of the snapshot

bland-article-62755

05/29/2025, 2:53 PM

scp the snapshot somewhere safe asap - would be my advice.

👍 1

bland-article-62755

05/29/2025, 2:54 PM

kubectl works on node3 (for a little while) right?

future-gigabyte-33261

05/29/2025, 2:54 PM

form time to time yes

future-gigabyte-33261

05/29/2025, 2:54 PM

even in node 1

future-gigabyte-33261

05/29/2025, 2:55 PM

but after 2-3 minutes it goes away

future-gigabyte-33261

05/29/2025, 2:55 PM

but its more stable in node 3

bland-article-62755

05/29/2025, 2:56 PM

does etcdctl work when it's up?

future-gigabyte-33261

05/29/2025, 2:57 PM

didnt try let me test it

future-gigabyte-33261

05/29/2025, 3:00 PM

harvester-node01:/var/lib/rancher/rke2/server/db # etcdctl bash: etcdctl: command not found

bland-article-62755

05/29/2025, 3:00 PM

should be something like

kubectl -n kube-system exec -it $(kubectl -n kube-system get pod -l component=etcd --no-headers -o custom-columns=NAME:.metadata.name | head -1) -- etcdctl --endpoints 127.0.0.1:2379 --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key endpoint status --cluster -w table

👍 1

bland-article-62755

05/29/2025, 3:01 PM

The binary for etcdctl is in a pod

future-gigabyte-33261

05/29/2025, 3:01 PM

let me test it

bland-article-62755

05/29/2025, 3:04 PM

Hey!

bland-article-62755

05/29/2025, 3:05 PM

You have a leader

bland-article-62755

05/29/2025, 3:05 PM

That's actually really good

bland-article-62755

05/29/2025, 3:05 PM

But you also have no space

bland-article-62755

05/29/2025, 3:05 PM

which is bad.

future-gigabyte-33261

05/29/2025, 3:05 PM

what is the error about the node 3

bland-article-62755

05/29/2025, 3:05 PM

You need to defrag

bland-article-62755

05/29/2025, 3:06 PM

I'm guessing etcd is crashlooping

future-gigabyte-33261

05/29/2025, 3:06 PM

yep

bland-article-62755

05/29/2025, 3:06 PM

$(kubectl -n kube-system get pod -l component=etcd --no-headers -o custom-columns=NAME:.metadata.name | head -1)

bland-article-62755

05/29/2025, 3:07 PM

uh hangon

👍 1

future-gigabyte-33261

05/29/2025, 3:07 PM

https://rancher-users.slack.com/archives/C01GKHKAG0K/p1748507935605799?thread_ts=1748172678.294309&cid=C01GKHKAG0K

bland-article-62755

05/29/2025, 3:07 PM

https://rancher-users.slack.com/archives/CC2UQM49Y/p1748465739388059?thread_ts=1748447421.569709&cid=CC2UQM49Y

bland-article-62755

05/29/2025, 3:08 PM

The command for degragging is in that script

❤️ 1

bland-article-62755

05/29/2025, 3:10 PM

also check

df -h

on the nodes and see if there's anything you can do to clear up more space.

bland-article-62755

05/29/2025, 3:10 PM

If it's needed

future-gigabyte-33261

05/29/2025, 3:10 PM

it has free space

bland-article-62755

05/29/2025, 3:11 PM

I'd also check the releases and see if it actually updated harvester or not

cat /etc/os-release

future-gigabyte-33261

05/29/2025, 3:12 PM

nope the os is not updated but the rancher image yes

future-gigabyte-33261

05/29/2025, 3:12 PM

the os is still 1.4.1

future-gigabyte-33261

05/29/2025, 3:12 PM

let me test your script

bland-article-62755

05/29/2025, 3:12 PM

Ok, well first get etcd into a healthy state

future-gigabyte-33261

05/29/2025, 3:13 PM

im just waiting for kubectl to come up

future-gigabyte-33261

05/29/2025, 3:16 PM

Copy code

harvester-node01:~ # ./brian-script.sh
Getting etcd Status
{"level":"warn","ts":"2025-05-29T15:15:35.007994Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:63","msg":"retrying of unary invoker failed","target":"<etcd-endpoints://0xc0007141e0/127.0.0.1:2379>","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing: dial tcp 10.10.100.30:2379: connect: connection refused\""}
Failed to get the status of endpoint <https://10.10.100.30:2379> (context deadline exceeded)
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------------------------------+
|         ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX |             ERRORS             |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------------------------------+
| <https://10.10.100.10:2379> | 4df3314745170102 |  3.5.16 |  2.1 GB |     false |      false |       620 |  124220055 |          124220055 |   memberID:5616887342432715010 |
|                           |                  |         |         |           |            |           |            |                    |                 alarm:NOSPACE  |
| <https://10.10.100.20:2379> | d76c47a87e7992ad |  3.5.16 |  2.1 GB |      true |      false |       620 |  124220182 |          124220182 |   memberID:5616887342432715010 |
|                           |                  |         |         |           |            |           |            |                    |                 alarm:NOSPACE  |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------------------------------+
command terminated with exit code 1
Defragging the etcd in the current cluster via etcd-harvester-node01
{"level":"warn","ts":"2025-05-29T15:15:40.196025Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:63","msg":"retrying of unary invoker failed","target":"<etcd-endpoints://0xc00031c000/127.0.0.1:2379>","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Failed to defragment etcd member[<https://10.10.100.10:2379>] (context deadline exceeded)
{"level":"warn","ts":"2025-05-29T15:15:45.197574Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:63","msg":"retrying of unary invoker failed","target":"<etcd-endpoints://0xc00031c000/127.0.0.1:2379>","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing: dial tcp 10.10.100.30:2379: connect: connection refused\""}
Failed to defragment etcd member[<https://10.10.100.30:2379>] (context deadline exceeded)
{"level":"warn","ts":"2025-05-29T15:15:50.202868Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:63","msg":"retrying of unary invoker failed","target":"<etcd-endpoints://0xc00031c000/127.0.0.1:2379>","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Failed to defragment etcd member[<https://10.10.100.20:2379>] (context deadline exceeded)
command terminated with exit code 1
Getting etcd Health
{"level":"warn","ts":"2025-05-29T15:15:55.375067Z","logger":"client","caller":"v3/retry_interceptor.go:63","msg":"retrying of unary invoker failed","target":"<etcd-endpoints://0xc00026e000/10.10.100.20:2379>","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
{"level":"warn","ts":"2025-05-29T15:15:55.375035Z","logger":"client","caller":"v3/retry_interceptor.go:63","msg":"retrying of unary invoker failed","target":"<etcd-endpoints://0xc00027a000/10.10.100.30:2379>","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing: dial tcp 10.10.100.30:2379: connect: connection refused\""}
+---------------------------+--------+--------------+---------------------------+
|         ENDPOINT          | HEALTH |     TOOK     |           ERROR           |
+---------------------------+--------+--------------+---------------------------+
| <https://10.10.100.10:2379> |  false |   6.953365ms | Active Alarm(s): NOSPACE  |
| <https://10.10.100.20:2379> |  false | 5.001969326s | context deadline exceeded |
| <https://10.10.100.30:2379> |  false | 5.001875971s | context deadline exceeded |
+---------------------------+--------+--------------+---------------------------+
Error: unhealthy cluster
command terminated with exit code 1
Getting etcd Status
{"level":"warn","ts":"2025-05-29T15:16:00.572917Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:63","msg":"retrying of unary invoker failed","target":"<etcd-endpoints://0xc0004fe1e0/127.0.0.1:2379>","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing: dial tcp 10.10.100.30:2379: connect: connection refused\""}
Failed to get the status of endpoint <https://10.10.100.30:2379> (context deadline exceeded)
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------------------------------+
|         ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX |             ERRORS             |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------------------------------+
| <https://10.10.100.10:2379> | 4df3314745170102 |  3.5.16 |  2.1 GB |     false |      false |       620 |  124220498 |          124220498 |   memberID:5616887342432715010 |
|                           |                  |         |         |           |            |           |            |                    |                 alarm:NOSPACE  |
| <https://10.10.100.20:2379> | d76c47a87e7992ad |  3.5.16 |  2.1 GB |      true |      false |       620 |  124220669 |          124220669 |   memberID:5616887342432715010 |
|                           |                  |         |         |           |            |           |            |                    |                 alarm:NOSPACE  |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------------------------------+
command terminated with exit code 1

bland-article-62755

05/29/2025, 3:17 PM

Can you just have it do the leader?

bland-article-62755

05/29/2025, 3:18 PM

ah looks like no

future-gigabyte-33261

05/29/2025, 3:19 PM

i dont think i can

bland-article-62755

05/29/2025, 3:19 PM

Oh th eno space alarm is probably for node3

bland-article-62755

05/29/2025, 3:19 PM

check the disk on node3?

bland-article-62755

05/29/2025, 3:23 PM

what's the output of

kubectl -n kube-system get pod -l component=etcd

future-gigabyte-33261

05/29/2025, 3:24 PM

wait a bit its gone again

bland-article-62755

05/29/2025, 3:28 PM

You can defrag just one node by the way:

kubectl -n kube-system exec -it etcd-harvester-node02 -- etcdctl --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt defrag

👍 1

bland-article-62755

05/29/2025, 3:29 PM

kubectl get nodes

future-gigabyte-33261

05/29/2025, 3:31 PM

this is the thing that confuses me

bland-article-62755

05/29/2025, 3:34 PM

Ok, here's the thing. I would have reached out and paid Suse for support a long time ago, but I'm guessing that's not an option for you? That being said, I think that at this point it's might be safe to stop the upgrade. https://docs.harvesterhci.io/v1.5/upgrade/troubleshooting/#stop-the-ongoing-upgrade

❤️ 1

✅ 1

bland-article-62755

05/29/2025, 3:35 PM

I'd cancel the upgrade and try to get the cluster into a healthy state before trying again.

bland-article-62755

05/29/2025, 3:37 PM

It's not good that there's no etcd scheduled on node03

bland-article-62755

05/29/2025, 3:41 PM

What I'd probably try (you may/probably want a second opinion here) is this: • Delete the upgrade. • un cordon node01. • reboot node03 and see if etcd comes back. • If it doesn't come back, cordon node03 • boot all my VMs and make sure they're running on nodes 1 & 2 • Delete node03 from the cluster. • Re-install node03 from the 1.4.1 iso. • Verify etcd is happy/healthy • Take a week or two off and plan the upgrade after 1.4.3 comes out (you might be able to jump right to it)

❤️ 1

future-gigabyte-33261

05/29/2025, 3:41 PM

Thank you so much for all your support we truly appreciate it. As a small startup, we’re currently limited in resources and unfortunately don’t have the capacity to manage paid support at the moment. Right now, our main goal is to get our services back online by restoring the cluster to version 1.4.1 and ensuring everything is up and running again. Thanks again for your help

future-gigabyte-33261

05/29/2025, 3:44 PM

thanks again @bland-article-62755

❤️ 1

future-gigabyte-33261

05/29/2025, 3:59 PM

cant delete the upgrade anymore

red-king-19196

06/03/2025, 4:05 PM

well, sorry for the late reply, but after checking the support bundle you provided, it seems that your cluster is in a bad shape

red-king-19196

06/03/2025, 4:06 PM

did you ever try to upgrade rke2 without using harvester upgrade? something like the steps in this doc

red-king-19196

06/03/2025, 4:07 PM

I saw there are two Plan CRs:

rke2-master-plan

and

rke2-worker-plan

red-king-19196

06/03/2025, 4:07 PM

you can check it out with the command

kubectl -n cattle-system get plans

red-king-19196

06/03/2025, 4:08 PM

These are not something created by the Harvester upgrade procedure. And the creation timestamp suggests they’re created a long time ago.

red-king-19196

06/03/2025, 4:10 PM

Besides that, the image preloading Plan CR looks pretty weird, I’ve never seen this before

Copy code

$ kubectl -n cattle-system get plans hvst-upgrade-grcnc-prepare -o yaml
apiVersion: <http://upgrade.cattle.io/v1|upgrade.cattle.io/v1>
kind: Plan
metadata:
  annotations:
    <http://sim.harvesterhci.io/creationTimestamp|sim.harvesterhci.io/creationTimestamp>: "2025-05-25T12:00:39Z"
  creationTimestamp: "2025-05-25T12:00:39Z"
  generation: 1
  labels:
    <http://harvesterhci.io/upgrade|harvesterhci.io/upgrade>: hvst-upgrade-grcnc
    <http://harvesterhci.io/upgradeComponent|harvesterhci.io/upgradeComponent>: node
  name: hvst-upgrade-grcnc-prepare
  namespace: cattle-system
  resourceVersion: "1910"
  uid: 20f139cc-1370-41f2-92b2-e3454dead2a7
spec:
  concurrency: 1
  jobActiveDeadlineSecs: 3600
  nodeSelector:
    matchExpressions:
    - key: <http://upgrade.cattle.io/disable|upgrade.cattle.io/disable>
      operator: Exists
    - key: <http://upgrade.cattle.io/disable|upgrade.cattle.io/disable>
      operator: Exists
    - key: <http://upgrade.cattle.io/disable|upgrade.cattle.io/disable>
      operator: Exists
    - key: <http://upgrade.cattle.io/disable|upgrade.cattle.io/disable>
      operator: Exists
    ... the list goes on and on ...

red-king-19196

06/03/2025, 4:13 PM

Also, the Upgrade CR does not look right. The nodeStatuses field only contains the third node:

Copy code

❯ kubectl -n harvester-system get upgrades hvst-upgrade-grcnc -o yaml
apiVersion: <http://harvesterhci.io/v1beta1|harvesterhci.io/v1beta1>
kind: Upgrade
metadata:
  annotations:
    <http://harvesterhci.io/auto-cleanup-system-generated-snapshot|harvesterhci.io/auto-cleanup-system-generated-snapshot>: "true"
    <http://harvesterhci.io/replica-replenishment-wait-interval|harvesterhci.io/replica-replenishment-wait-interval>: "600"
    <http://sim.harvesterhci.io/creationTimestamp|sim.harvesterhci.io/creationTimestamp>: "2025-05-25T11:22:06Z"
  creationTimestamp: "2025-05-25T11:22:06Z"
  finalizers:
  - <http://wrangler.cattle.io/harvester-upgrade-controller|wrangler.cattle.io/harvester-upgrade-controller>
  generateName: hvst-upgrade-
  generation: 1
  labels:
    <http://harvesterhci.io/latestUpgrade|harvesterhci.io/latestUpgrade>: "true"
    <http://harvesterhci.io/upgradeState|harvesterhci.io/upgradeState>: UpgradingNodes
  name: hvst-upgrade-grcnc
  namespace: harvester-system
  resourceVersion: "2387"
  uid: 4edce737-efd1-4000-aa75-5066137c1e0d
spec:
  logEnabled: true
  version: v1.4.2
status:
  conditions:
  - status: Unknown
    type: Completed
  - status: "True"
    type: LogReady
  - status: "True"
    type: ImageReady
  - status: "True"
    type: RepoReady
  - lastUpdateTime: "2025-05-25T22:13:13Z"
    status: "True"
    type: NodesPrepared
  - status: "True"
    type: SystemServicesUpgraded
  - status: Unknown
    type: NodesUpgraded
  imageID: harvester-system/hvst-upgrade-grcnc
  nodeStatuses:
    harvester-node03:
      state: Images preloading
  previousVersion: v1.4.1
  repoInfo: |
    release:
      harvester: v1.4.2
      harvesterChart: 1.4.2
      os: Harvester v1.4.2
      kubernetes: v1.31.4+rke2r1
      rancher: v2.10.1
      monitoringChart: 103.1.1+up45.31.1
      minUpgradableVersion: v1.4.1
  upgradeLog: hvst-upgrade-grcnc-upgradelog

red-king-19196

06/03/2025, 4:14 PM

The upgrade process was stuck due to this. But the root cause can be hard to reason about because there are many abnormal things observed.

future-gigabyte-33261

06/13/2025, 10:20 PM

we nuked the cluster

💥 2

🦜 1

6 Views

Open in Slack

Previous Next