Hi im upgrading my prod cluster (3 node cluster HP...
# harvester
f
Hi im upgrading my prod cluster (3 node cluster HP DLG9) form 1.4.1 to 1.4.2 the upgrade process gets stuck either here or in Image preloaded in the node 1 the only container failing is this one Any idea?
image.png
b
Did the pre-check script pass all the tests?
👍 1
What do the logs and events from the crash looping pod say?
f
yes
i fixed the error by removing the other logging cluster outputs and Flow
but now its stuck and its not crating the jobs in the cattle-system namespace
an its stuck in this state
its been 2 years now running harvester in prod and each upgrade its becoming a nightmare
👀 1
r
Hey @future-gigabyte-33261, could you generate a support bundle for us to assess the situation?
f
Yes
b
I normally start rebooting the nodes that are stuck after verifying they are drained from active VMs. I feel like it's 50/50 on if it will fix things.
f
I will try tomorrow and also generate the support bundle since today i cant access the infra
p
always be ready to nuke the cluster is my motto 😅
f
supportbundle_c902b5ab-8c69-4eda-82e4-68728d1b2e9e_2025-05-27T16-20-02Z.zip
👀 1
@red-king-19196
f
Hello @red-king-19196! @future-gigabyte-33261 and I tried to investigate a bit on our own, but basically we only restarted the nodes a couple of times. We noticed that the etcd database size was not matching across the nodes, and that there was a sudden increase in size at least in one of the nodes when the upgrade started (a jump from 77MB to 1.6GB). Besides this, the rke2-server service is failing to start on all nodes, throwing 502 and 503 errors.
ideally, we'd prefer to continue with the upgrade to 1.4.2 (and then to 1.4.3). However, if this is not possible with the current state, we'd prefer to roll back to the state previous to the upgrade and start again from there.
b
etcd may just need to get defragged
Are you monitoring events?
ie
watch kubectl get events -A
f
I cant use kubectl
b
O.o
why not?
f
Rke2 service issues
1sec
image.png
b
Uh. That's not good
p
3ogp2b.jpg
f
image.png
b
Looking at the harvester UI, are you on a node it currently thinks is the control plane?
f
i have lost the ui
the cluster is degraded
this is node 1 and its supposed to be main node
b
I'd log into all the other nodes and see if you have kubectl from any of them.
f
let me do that
1 sec
ok i have in node 3
the wired issue is that it comes
and then it dissaperas
image.png
b
Network failures
f
and now its gone agian
b
start checking
systemctl
for the server processes and make sure they're running properly and/or journald
sounds like rke2 is crashing
f
yep
image.png
@bland-article-62755 node1 is master and its 10.10.100.1 node3
meanwhile in node 2 is starting the rke2-server service
b
I've only recovered etcd from split brain once and we needed our Suse support contract to do it.
🦜 1
It sounds like you might have rebooted the etcd nodes before they could recover and it's now in a bad state.
f
and this has always confused when we set this cluster up we did the following: 1. setup node 1 create cluster 2. setup node 2 join into the cluster 3. setup node 3 join into the cluster
b
Yeah, after node2 and node3 join, they get promoted so you have HA.
f
ye but in node 2 is running rke2-server
in node 3 is running rke2-agent
service
b
but if you turn them all off at the same time before etcd is healthy bad things™️ happen
f
and all the vms where of
and the node was cordoned
i restarted it
b
but basically we only restarted the nodes a couple of times.
f
thats what i did also
nothing else
b
that's nodes plural
f
`> for ip in 10.10.100.10 10.10.100.20 10.10.100.30; do ssh rancher@$ip "sudo su -c 'reboot'"; done``
at first i did only node onw
then i stayed in the same state
then
reboot all
b
Well at least we know what's wrong
f
im still not understanding whats wrong
🙇‍♂️ 1
b
you turned off all the nodes at the same time and etcd is failing.
Do you have etcd backups?
https://harvesterhci.io/kb/ is also going to be helpful
f
in external storage no
but we have the snapshots
b
worst case you can do a
--cluster-reset
then restore one of the snapshot
scp the snapshot somewhere safe asap - would be my advice.
👍 1
kubectl works on node3 (for a little while) right?
f
form time to time yes
even in node 1
but after 2-3 minutes it goes away
but its more stable in node 3
b
does etcdctl work when it's up?
f
didnt try let me test it
harvester-node01:/var/lib/rancher/rke2/server/db # etcdctl bash: etcdctl: command not found
b
should be something like
kubectl -n kube-system exec -it $(kubectl -n kube-system get pod -l component=etcd --no-headers -o custom-columns=NAME:.metadata.name | head -1) -- etcdctl --endpoints 127.0.0.1:2379 --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key endpoint status --cluster -w table
👍 1
The binary for etcdctl is in a pod
f
let me test it
image.png
b
Hey!
You have a leader
That's actually really good
But you also have no space
which is bad.
f
what is the error about the node 3
b
You need to defrag
I'm guessing etcd is crashlooping
f
yep
b
$(kubectl -n kube-system get pod -l component=etcd --no-headers -o custom-columns=NAME:.metadata.name | head -1)
uh hangon
👍 1
The command for degragging is in that script
❤️ 1
also check
df -h
on the nodes and see if there's anything you can do to clear up more space.
If it's needed
f
it has free space
image.png
b
I'd also check the releases and see if it actually updated harvester or not
cat /etc/os-release
f
nope the os is not updated but the rancher image yes
the os is still 1.4.1
let me test your script
b
Ok, well first get etcd into a healthy state
f
im just waiting for kubectl to come up
Copy code
harvester-node01:~ # ./brian-script.sh
Getting etcd Status
{"level":"warn","ts":"2025-05-29T15:15:35.007994Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:63","msg":"retrying of unary invoker failed","target":"<etcd-endpoints://0xc0007141e0/127.0.0.1:2379>","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing: dial tcp 10.10.100.30:2379: connect: connection refused\""}
Failed to get the status of endpoint <https://10.10.100.30:2379> (context deadline exceeded)
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------------------------------+
|         ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX |             ERRORS             |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------------------------------+
| <https://10.10.100.10:2379> | 4df3314745170102 |  3.5.16 |  2.1 GB |     false |      false |       620 |  124220055 |          124220055 |   memberID:5616887342432715010 |
|                           |                  |         |         |           |            |           |            |                    |                 alarm:NOSPACE  |
| <https://10.10.100.20:2379> | d76c47a87e7992ad |  3.5.16 |  2.1 GB |      true |      false |       620 |  124220182 |          124220182 |   memberID:5616887342432715010 |
|                           |                  |         |         |           |            |           |            |                    |                 alarm:NOSPACE  |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------------------------------+
command terminated with exit code 1
Defragging the etcd in the current cluster via etcd-harvester-node01
{"level":"warn","ts":"2025-05-29T15:15:40.196025Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:63","msg":"retrying of unary invoker failed","target":"<etcd-endpoints://0xc00031c000/127.0.0.1:2379>","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Failed to defragment etcd member[<https://10.10.100.10:2379>] (context deadline exceeded)
{"level":"warn","ts":"2025-05-29T15:15:45.197574Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:63","msg":"retrying of unary invoker failed","target":"<etcd-endpoints://0xc00031c000/127.0.0.1:2379>","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing: dial tcp 10.10.100.30:2379: connect: connection refused\""}
Failed to defragment etcd member[<https://10.10.100.30:2379>] (context deadline exceeded)
{"level":"warn","ts":"2025-05-29T15:15:50.202868Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:63","msg":"retrying of unary invoker failed","target":"<etcd-endpoints://0xc00031c000/127.0.0.1:2379>","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Failed to defragment etcd member[<https://10.10.100.20:2379>] (context deadline exceeded)
command terminated with exit code 1
Getting etcd Health
{"level":"warn","ts":"2025-05-29T15:15:55.375067Z","logger":"client","caller":"v3/retry_interceptor.go:63","msg":"retrying of unary invoker failed","target":"<etcd-endpoints://0xc00026e000/10.10.100.20:2379>","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
{"level":"warn","ts":"2025-05-29T15:15:55.375035Z","logger":"client","caller":"v3/retry_interceptor.go:63","msg":"retrying of unary invoker failed","target":"<etcd-endpoints://0xc00027a000/10.10.100.30:2379>","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing: dial tcp 10.10.100.30:2379: connect: connection refused\""}
+---------------------------+--------+--------------+---------------------------+
|         ENDPOINT          | HEALTH |     TOOK     |           ERROR           |
+---------------------------+--------+--------------+---------------------------+
| <https://10.10.100.10:2379> |  false |   6.953365ms | Active Alarm(s): NOSPACE  |
| <https://10.10.100.20:2379> |  false | 5.001969326s | context deadline exceeded |
| <https://10.10.100.30:2379> |  false | 5.001875971s | context deadline exceeded |
+---------------------------+--------+--------------+---------------------------+
Error: unhealthy cluster
command terminated with exit code 1
Getting etcd Status
{"level":"warn","ts":"2025-05-29T15:16:00.572917Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:63","msg":"retrying of unary invoker failed","target":"<etcd-endpoints://0xc0004fe1e0/127.0.0.1:2379>","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing: dial tcp 10.10.100.30:2379: connect: connection refused\""}
Failed to get the status of endpoint <https://10.10.100.30:2379> (context deadline exceeded)
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------------------------------+
|         ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX |             ERRORS             |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------------------------------+
| <https://10.10.100.10:2379> | 4df3314745170102 |  3.5.16 |  2.1 GB |     false |      false |       620 |  124220498 |          124220498 |   memberID:5616887342432715010 |
|                           |                  |         |         |           |            |           |            |                    |                 alarm:NOSPACE  |
| <https://10.10.100.20:2379> | d76c47a87e7992ad |  3.5.16 |  2.1 GB |      true |      false |       620 |  124220669 |          124220669 |   memberID:5616887342432715010 |
|                           |                  |         |         |           |            |           |            |                    |                 alarm:NOSPACE  |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------------------------------+
command terminated with exit code 1
b
Can you just have it do the leader?
ah looks like no
f
i dont think i can
b
Oh th eno space alarm is probably for node3
check the disk on node3?
f
image.png
b
what's the output of
kubectl -n kube-system get pod -l component=etcd
f
wait a bit its gone again
b
You can defrag just one node by the way:
kubectl -n kube-system exec -it etcd-harvester-node02 -- etcdctl --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt defrag
👍 1
f
image.png
b
kubectl get nodes
?
f
image.png
this is the thing that confuses me
b
Ok, here's the thing. I would have reached out and paid Suse for support a long time ago, but I'm guessing that's not an option for you? That being said, I think that at this point it's might be safe to stop the upgrade. https://docs.harvesterhci.io/v1.5/upgrade/troubleshooting/#stop-the-ongoing-upgrade
❤️ 1
1
I'd cancel the upgrade and try to get the cluster into a healthy state before trying again.
It's not good that there's no etcd scheduled on node03
What I'd probably try (you may/probably want a second opinion here) is this: • Delete the upgrade. • un cordon node01. • reboot node03 and see if etcd comes back. • If it doesn't come back, cordon node03 • boot all my VMs and make sure they're running on nodes 1 & 2 • Delete node03 from the cluster. • Re-install node03 from the 1.4.1 iso. • Verify etcd is happy/healthy • Take a week or two off and plan the upgrade after 1.4.3 comes out (you might be able to jump right to it)
❤️ 1
f
Thank you so much for all your support we truly appreciate it. As a small startup, we’re currently limited in resources and unfortunately don’t have the capacity to manage paid support at the moment. Right now, our main goal is to get our services back online by restoring the cluster to version 1.4.1 and ensuring everything is up and running again. Thanks again for your help
thanks again @bland-article-62755
❤️ 1
image.png
cant delete the upgrade anymore
r
well, sorry for the late reply, but after checking the support bundle you provided, it seems that your cluster is in a bad shape
did you ever try to upgrade rke2 without using harvester upgrade? something like the steps in this doc
I saw there are two Plan CRs:
rke2-master-plan
and
rke2-worker-plan
you can check it out with the command
kubectl -n cattle-system get plans
These are not something created by the Harvester upgrade procedure. And the creation timestamp suggests they’re created a long time ago.
Besides that, the image preloading Plan CR looks pretty weird, I’ve never seen this before
Copy code
$ kubectl -n cattle-system get plans hvst-upgrade-grcnc-prepare -o yaml
apiVersion: <http://upgrade.cattle.io/v1|upgrade.cattle.io/v1>
kind: Plan
metadata:
  annotations:
    <http://sim.harvesterhci.io/creationTimestamp|sim.harvesterhci.io/creationTimestamp>: "2025-05-25T12:00:39Z"
  creationTimestamp: "2025-05-25T12:00:39Z"
  generation: 1
  labels:
    <http://harvesterhci.io/upgrade|harvesterhci.io/upgrade>: hvst-upgrade-grcnc
    <http://harvesterhci.io/upgradeComponent|harvesterhci.io/upgradeComponent>: node
  name: hvst-upgrade-grcnc-prepare
  namespace: cattle-system
  resourceVersion: "1910"
  uid: 20f139cc-1370-41f2-92b2-e3454dead2a7
spec:
  concurrency: 1
  jobActiveDeadlineSecs: 3600
  nodeSelector:
    matchExpressions:
    - key: <http://upgrade.cattle.io/disable|upgrade.cattle.io/disable>
      operator: Exists
    - key: <http://upgrade.cattle.io/disable|upgrade.cattle.io/disable>
      operator: Exists
    - key: <http://upgrade.cattle.io/disable|upgrade.cattle.io/disable>
      operator: Exists
    - key: <http://upgrade.cattle.io/disable|upgrade.cattle.io/disable>
      operator: Exists
    ... the list goes on and on ...
Also, the Upgrade CR does not look right. The nodeStatuses field only contains the third node:
Copy code
❯ kubectl -n harvester-system get upgrades hvst-upgrade-grcnc -o yaml
apiVersion: <http://harvesterhci.io/v1beta1|harvesterhci.io/v1beta1>
kind: Upgrade
metadata:
  annotations:
    <http://harvesterhci.io/auto-cleanup-system-generated-snapshot|harvesterhci.io/auto-cleanup-system-generated-snapshot>: "true"
    <http://harvesterhci.io/replica-replenishment-wait-interval|harvesterhci.io/replica-replenishment-wait-interval>: "600"
    <http://sim.harvesterhci.io/creationTimestamp|sim.harvesterhci.io/creationTimestamp>: "2025-05-25T11:22:06Z"
  creationTimestamp: "2025-05-25T11:22:06Z"
  finalizers:
  - <http://wrangler.cattle.io/harvester-upgrade-controller|wrangler.cattle.io/harvester-upgrade-controller>
  generateName: hvst-upgrade-
  generation: 1
  labels:
    <http://harvesterhci.io/latestUpgrade|harvesterhci.io/latestUpgrade>: "true"
    <http://harvesterhci.io/upgradeState|harvesterhci.io/upgradeState>: UpgradingNodes
  name: hvst-upgrade-grcnc
  namespace: harvester-system
  resourceVersion: "2387"
  uid: 4edce737-efd1-4000-aa75-5066137c1e0d
spec:
  logEnabled: true
  version: v1.4.2
status:
  conditions:
  - status: Unknown
    type: Completed
  - status: "True"
    type: LogReady
  - status: "True"
    type: ImageReady
  - status: "True"
    type: RepoReady
  - lastUpdateTime: "2025-05-25T22:13:13Z"
    status: "True"
    type: NodesPrepared
  - status: "True"
    type: SystemServicesUpgraded
  - status: Unknown
    type: NodesUpgraded
  imageID: harvester-system/hvst-upgrade-grcnc
  nodeStatuses:
    harvester-node03:
      state: Images preloading
  previousVersion: v1.4.1
  repoInfo: |
    release:
      harvester: v1.4.2
      harvesterChart: 1.4.2
      os: Harvester v1.4.2
      kubernetes: v1.31.4+rke2r1
      rancher: v2.10.1
      monitoringChart: 103.1.1+up45.31.1
      minUpgradableVersion: v1.4.1
  upgradeLog: hvst-upgrade-grcnc-upgradelog
The upgrade process was stuck due to this. But the root cause can be hard to reason about because there are many abnormal things observed.
f
we nuked the cluster
💥 1
🦜 1