This message was deleted Rancher Users #harvester

Join Slack

This message was deleted.

# harvester

adamant-kite-43734

01/08/2025, 7:27 PM

This message was deleted.

acoustic-addition-45641

01/08/2025, 7:39 PM

Depending on how the node was originally set up, you may not be able to. Ref: https://docs.harvesterhci.io/v1.4/host/#role-management • Worker: Restricts a node to being a worker node (never promoted to management node) in a specific cluster. I am interested to see if anyone has found a method for doing this (short of removing the worker node and rebuilding it with the "Manager" role).

miniature-lock-53926

01/08/2025, 7:42 PM

The Problem is, that a third control-plane node was originally deployed but was deleted when trying to start an upgrade and now the automativ promotion job was running but was not successful in promoting the node... Now the upgrade startet but it is stuck because there is only 2 control-plane nodes but it needs to schedule three pods for Rancher, Harvester and Harvester-Webhook I am at a loss

miniature-lock-53926

01/08/2025, 7:43 PM

I thought about this and maybe I should add the deleted node back again with the management role... maybe that fixes it

miniature-lock-53926

01/08/2025, 7:45 PM

But I think it should be possible the just trigger a manual promotion of one of the remaining nodes... I am remote and cannot easily get the correct installation medium on the deleted node, otherwise I would have tried this already

miniature-lock-53926

01/08/2025, 7:50 PM

I also just finished writing it up and created an Issue on Github for more clarity https://github.com/harvester/harvester/issues/7331

👍 1

acoustic-addition-45641

01/08/2025, 7:50 PM

Best of luck, and please share findings/lessons learned. I typically deploy Harvester as a 3-node cluster (all with manager roles) in a lab environment, but have been considering moving to 5 nodes with the manager role for stability. The underlying etcd being potentially corrupt or left in a RO state if I lose a node can make for sleepless nights in production. Backing up etcd should help.

🙏 1

miniature-lock-53926

01/08/2025, 7:52 PM

Thanks. I am hoping Rancherlabs can help here... I really like the whole Rancher, Harvester, RKE2 Ecosystem but sometimes one feels really helpless when the stuff is not working automagically as it usually does

💯 2

happy-cat-90847

01/09/2025, 12:56 PM

@miniature-lock-53926 why upgrade when you have a split control plane? The team could help if the promotion failed. You should be able to cancel the upgrade - at what point is it stuck?

miniature-lock-53926

01/09/2025, 12:58 PM

I really did not want to upgrade but it started by accident while I was preparing the github issue and wanted to take a screenshot of the original error. We were already discussing aborting the upgrade but wanted to wait for a response from Rancherlabs.

miniature-lock-53926

01/09/2025, 12:59 PM

I think we are stuck at end of phase 3

miniature-lock-53926

01/09/2025, 1:00 PM

If it is possible to do it safe at this phase we should abort it here and as you suggested first try to get the controlplane back up to 3 nodes

happy-cat-90847

01/09/2025, 1:20 PM

There will certainly be a need to have a proper cluster. Where is the upgrade stuck on?

miniature-lock-53926

01/09/2025, 1:24 PM

It is stuck trying to schedule a third instance of harvester, harvester-webhook and rancher

happy-cat-90847

01/09/2025, 1:46 PM

Ok, I see. We’d have to see what others say. It looks like everything upgraded already. It’s just the actual nodes at the moment. Have you tried to join it back in?

happy-cat-90847

01/09/2025, 1:47 PM

Got it.

miniature-lock-53926

01/09/2025, 1:49 PM

It was the host3 ... we accidentially deleted it while deleting a machine/x-cluster CR that was stuck in reconciling state and afterwards we deleted it cleanly from the cluster... it is still there and has harvester 1.3.1 installed but will not rejoin the cluster

miniature-lock-53926

01/09/2025, 1:49 PM

We were considering it but are not sure if it will work while the upgrade is running

happy-cat-90847

01/09/2025, 1:50 PM

I’d wipe it out and try join it. But with the version you are upgrading to.

miniature-lock-53926

01/09/2025, 1:50 PM

and we are still not sure if it is safe to abort the upgrade

happy-cat-90847

01/09/2025, 1:50 PM

Yes, it should be. The services were upgraded. It’s just the nodes. But perhaps wait for someone to chime in.

miniature-lock-53926

01/09/2025, 1:51 PM

You just did chime in... but you probably mean some on the github issue, right?

miniature-lock-53926

01/09/2025, 1:57 PM

How would we abort the upgrade?

enough-australia-5601

01/09/2025, 2:27 PM

Why is

haa-devops-harvester02-node02

and

haa-devops-harvester02-node05

cordoned? Was this part of the upgrade or was this done manually?

miniature-lock-53926

01/09/2025, 2:28 PM

No this is done via the update. If I uncordon it manually it deploys the pending pods for a minute and then cordons the node again and reschedules the pods

miniature-lock-53926

01/09/2025, 2:31 PM

Thats also why our uprade is stuck at Upgrading System Services 100% instead of 50% when it has never been able to schedule all pods at least once. That is also why I think if we could scale the deployments of Rancher, Harvester and Harvester-Webhook to 2 Replicas, the upgrade would also finish successfully, but because the deployments are controlled with helm deployed with the mcc-harvester fleet-bundle I also cannot permanently scale the Deployments to 2 Replicas

salmon-city-57654

01/09/2025, 3:16 PM

Could you generate the SB?

salmon-city-57654

01/09/2025, 3:16 PM

Looks like node2 is stuck on draining?

miniature-lock-53926

01/09/2025, 3:17 PM

We already have. I sent Rancherlabs the download link to harvester-support-bundle@suse.com today because I could not send the SB via email directly

miniature-lock-53926

01/09/2025, 3:17 PM

but I can also upload it here if that is more conveniant

salmon-city-57654

01/09/2025, 3:18 PM

You can upload SB here. I thought that would be more convenient.

salmon-city-57654

01/09/2025, 3:20 PM

This is the upgrade log. Could you generate support bundle? REF: https://docs.harvesterhci.io/v1.3/troubleshooting/harvester#generate-a-support-bundle

miniature-lock-53926

01/09/2025, 3:20 PM

I picked the wrong file, it is already uploading 🙂

👍 1

miniature-lock-53926

01/09/2025, 3:20 PM

wrong file, sorry

supportbundle_207a51d7-61ff-4f36-8785-38454b6ce253_2025-01-08T14-23-19Z.zip

salmon-city-57654

01/09/2025, 3:21 PM

No worries, the upgrade log also helps. But SB could simulate your current environment.

👍 1

miniature-lock-53926

01/09/2025, 3:22 PM

Ah, I never knew you had like a lab-simulator where you can drop the SB into... nice

miniature-lock-53926

01/09/2025, 3:23 PM

This is the supportbundle before the upgrade btw... should I create a new one to get the current state then?

salmon-city-57654

01/09/2025, 3:24 PM

Just in my local environment. 😆 https://github.com/rancher/support-bundle-kit could help simulate.

❤️ 1

salmon-city-57654

01/09/2025, 3:24 PM

Sure, please generate a new one to get the current state.

miniature-lock-53926

01/09/2025, 3:31 PM

That is quite useful. BTW are there any secrets embedded in the SB that should be changed afterwards or is it not necessary?

salmon-city-57654

01/09/2025, 3:33 PM

No, we did not collect the secrets

miniature-lock-53926

01/09/2025, 3:33 PM

Here is the current state

supportbundle_207a51d7-61ff-4f36-8785-38454b6ce253_2025-01-09T15-25-27Z.zip

👍 1

salmon-city-57654

01/09/2025, 3:54 PM

hmm, did you want to upgrade or drop the upgrade?

salmon-city-57654

01/09/2025, 3:54 PM

Looks like the drain is stuck because the im-pod cannot be drained.

miniature-lock-53926

01/09/2025, 3:57 PM

If we could drop it safely I think we would prefer it

miniature-lock-53926

01/09/2025, 3:58 PM

Than we could fix the controlplane and retry it again

miniature-lock-53926

01/09/2025, 3:58 PM

What do you mean by im-pod?

salmon-city-57654

01/09/2025, 4:01 PM

The instance-manager pod (from Longhorn) was protected by PDB until all replicas on the draining node were moved to other nodes.

salmon-city-57654

01/09/2025, 4:02 PM

so you want to upgrade from v1.3.1 -> 1.3.2?

miniature-lock-53926

01/09/2025, 4:02 PM

Ah ok, I did not see that. What would you recommend? Roll back or go forward?

miniature-lock-53926

01/09/2025, 4:02 PM

Yes but ultimately to 1.4.0

miniature-lock-53926

01/09/2025, 4:04 PM

because of this bug, which we have https://github.com/harvester/harvester/issues/7021

salmon-city-57654

01/09/2025, 4:05 PM

Most of the components were upgraded. So, I am not really sure whether dropping the upgrade is good or not.

miniature-lock-53926

01/09/2025, 4:06 PM

Well then we would need to shake something lose to get the upgrade rolling again, I guess

miniature-lock-53926

01/09/2025, 4:08 PM

I think this is also something that would need to finish sucessfully for Phase4 to start, right?

salmon-city-57654

01/09/2025, 4:09 PM

Hmm, I don’t think this fleet complaint would affect the upgrade at this moment.

salmon-city-57654

01/09/2025, 4:10 PM

But I am worried that if we only have 2 nodes in etcd, once one of the control plane node rebooted, we might lost the cluster temporarily (because the etcd cluster might not be alive)

miniature-lock-53926

01/09/2025, 4:12 PM

yeah that is why I originally thought that it would be the first priority to get one of the 2 worker nodes promoted to control-plane but I dont know how I would do that... except for retriggering the failed promotion job for host5

miniature-lock-53926

01/09/2025, 4:14 PM

Also I dont know if the running upgrade would even allow another node to be promoted at this point

miniature-lock-53926

01/09/2025, 4:19 PM

TBH I don't even understand how we ended up in this weird edgecase state. It should not even be possible as far as I understood it to have a HA Cluster only running with 2 control-plane nodes... I know that I am probably at fault for getting us into this edge-case but at the same time it feels, that this should be a recoverable state.... especially because after accidentally triggering the deletion of one of the control-plane nodes, an promotion job for another node was already triggered but just did not succeede, which probably could be a bug in and of itself

salmon-city-57654

01/09/2025, 4:19 PM

I am checking the code. IIUC, the promote controller should not be blocked during the upgrade so I am checking.

❤️ 1

miniature-lock-53926

01/09/2025, 4:22 PM

The running promote pods had specific errors but we thought maybe those might be "normal" and decided to let it run a litte longer but then those pods were deleted and only the failed job remained, which I did not notice until yesterday when it was to late

salmon-city-57654

01/09/2025, 4:23 PM

The running promote pods had specific errors

Did you mean you have logs of promote pod?

salmon-city-57654

01/09/2025, 4:25 PM

Oh… I found that one on old SB

🙌 1

miniature-lock-53926

01/09/2025, 4:27 PM

thats why I was thinking about just rerunning the promote job, but that is even better

salmon-city-57654

01/09/2025, 4:30 PM

it’s weird, the corresponding label was not be added

salmon-city-57654

01/09/2025, 4:30 PM

From the pod logs, it should be added

salmon-city-57654

01/09/2025, 4:31 PM

Did you manually change anything on Node CR of node5?

miniature-lock-53926

01/09/2025, 4:31 PM

yeah, our conclusion was, that because the x-cluster CRs and the RKEControlPlane local was not finished reconciling that this blocked the promotion from completing

miniature-lock-53926

01/09/2025, 4:32 PM

Did you manually change anything on Node CR of node5?

not that I remember

salmon-city-57654

01/09/2025, 4:33 PM

From the log

Waiting for promotion...

That means everything should be settled. We just wait for the status change.

miniature-lock-53926

01/09/2025, 4:35 PM

yeah well after the promotion pods were no longer running, I just assumed that the promotion was then successful but host 5 didnt have the right node-roles.... and that is when I started the writeup of the Issue and while trying to get a screenshot of the original error, this upgrade was triggered accidentally

salmon-city-57654

01/09/2025, 4:37 PM

hmm, if the promotion is successful, the promote job should be completed

miniature-lock-53926

01/09/2025, 4:46 PM

but it did not. Do you think that the Reconciling Cluster CRs were at fault and should I try rerunning the promotion job?

salmon-city-57654

01/09/2025, 4:48 PM

I am checking the label, seems some labels were correct

acoustic-addition-45641

01/09/2025, 4:52 PM

@miniature-lock-53926 Reading your comment of, "TBH I don't even understand how we ended up in this weird edgecase state." reminded me that I ran into a similar state that led to my discovering that one of my three Dell R740XD2's had a different processor generation than the other two. This led to the upgrade VM being unable to migrate to the server with the different processor and resulted in a hung upgrade state. Ref: https://github.com/harvester/harvester/issues/7096 Not saying that this is your issue, but something to watch out for (heterogenous hardware).

❤️ 1

salmon-city-57654

01/09/2025, 5:00 PM

I need some time to discuss this situation (means promotion failure). Will update here tomorrow.

❤️ 1

salmon-city-57654

01/09/2025, 5:00 PM

Looks like the labels were correct

miniature-lock-53926

01/09/2025, 5:01 PM

Perfect I also am finished with work today, I will be here again tomorrow morning CET

👍 1

enough-australia-5601

01/09/2025, 5:05 PM

especially because after accidentally triggering the deletion of one of the control-plane nodes

Deleting a single control-plane node in a cluster with three control planes is a really awkward edge case. etcd requires the majority of control-plane nodes to be in good working order to maintain quorum and accept writes, otherwise it will fall back into a read-only mode to preserve data consistency and avoid a split-brain scenario. The minimum number of nodes required for etcd can not be changed willy-nilly, usually it can only be done by backing up the data and restoring into a completely fresh etcd instance. With one out of three control-plane nodes gone, this puts the etcd in a weird spot, where it requires both other nodes to still be there to maintain the quorum. I found an older Rancher issue about promoting RKE2 workers to master nodes, but none of what's discussed there seems to be out of whack in this case. https://github.com/rancher/rancher/issues/36480#issuecomment-1039253499 I'll need some more time to look for a smoking gun and check back with Vincente before I can advise about what to do next

❤️ 1

millions-microphone-3535

01/09/2025, 10:56 PM

@miniature-lock-53926 so from the support bundle, i can see that

host05

was scheduled for node promotion, but what i don't get is why (from the kube-controller-manager logs) the promotion job was re-enqueued multiple times (within a 5-minute timeframe) even after a promotion job pod was started

millions-microphone-3535

01/09/2025, 10:56 PM

that support bundle didn't have the logs of the promotion job pods so i couldn't tell if they finished, failed, or what

millions-microphone-3535

01/09/2025, 10:57 PM

AIUI, the worker-to-control-plane promotion should be automatic, if a control plane node was removed, per https://docs.harvesterhci.io/v1.4/host/#1-check-if-the-node-can-be-removed-from-the-cluster

millions-microphone-3535

01/09/2025, 11:00 PM

you mentioned you manually restarted the promotion job.. can you grab the logs from those pods? they will be in the

harvester-system

namespace and should be named with the prefix

harvester-promote-haa-devops-harvester02-host05-

millions-microphone-3535

01/09/2025, 11:02 PM

as far as upgrade is concerned, i think, if possible, you should try to add 2 more control plane nodes to the cluster to get the cluster back to a usable state first, before we attempt to suggest any more changes, per https://rancher-users.slack.com/archives/C01GKHKAG0K/p1736430602309099?thread_ts=1736364466.420119&cid=C01GKHKAG0K

miniature-lock-53926

01/10/2025, 10:37 AM

I am actually not 100% sure what the promotion pod did the first time, but I think I remember it had really specific errors, that convinced us that the promotion was stuck and thats why we activated and deactivated the the maintenance mode on host 5 in an attempt retrigger the promotion pod after it got stuck or (now come to think of it) maybe was just taking longer than we expecteded while throwing unrelated/unimportant error or warnings and we just thought it was stuck. At that point we did not understand what controls the promotion and were not aware that there was a promotion job in the first place. In any case that again restarted the promotion process and new promotion pods were being created and afterwards we decided to wait longer this time and after a while no promotion pod was running anymore and thats why I originally thought the promotion went through this time until I found the failed promotion job after the upgrade was started.

miniature-lock-53926

01/10/2025, 10:39 AM

I was thinking about retriggering the promotion job by just running the failed manifest again to get the promotion logs again, but so far did not do that because I wanted to get your input first.

miniature-lock-53926

01/10/2025, 10:44 AM

So I shoud NOT just try to install Host3 from scratch to Version 1.3.2 and try to rejoin the cluster WHILE the upgrade is running but rather abort the upgrade at this point which is safe and could also fix the controlplane again or at least make it possible to join a new control-plane node by installing 1.3.2 and joining the cluster, correct? How would I aboirt the upgrade is it just

kubectl delete upgrade -n cattle-system hvst-upgrade-78k2f-prepare

enough-australia-5601

01/10/2025, 12:19 PM

I still don't understand why

host05

is cordoned. It doesn't seem like it's being drained, but I think this may pose a problem with scheduling the required pods to complete the promotion, because while they may tolerate taints with

NoExecute

effect, they may not tolerate a taint with

NoSchedule

effect:

Copy code

│ ~/Downloads/stuck_upgrade/supportbundle_207a51d7-61ff-4f36-8785-38454b6ce253_2025-01-09T15-25-27Z │ 130 ► yq '.items[] | {"name": .metadata.name, "taints": .spec.taints}' yamls/cluster/v1/nodes.yaml 
name: haa-devops-harvester02-host01
taints: null
name: haa-devops-harvester02-host02
taints:
  - effect: NoSchedule
    key: <http://kubevirt.io/drain|kubevirt.io/drain>
    value: draining
  - effect: NoSchedule
    key: <http://node.kubernetes.io/unschedulable|node.kubernetes.io/unschedulable>
    timeAdded: "2025-01-08T21:14:36Z"
name: haa-devops-harvester02-host04
taints: null
name: haa-devops-harvester02-host05
taints:
  - effect: NoSchedule
    key: <http://node.kubernetes.io/unschedulable|node.kubernetes.io/unschedulable>
    timeAdded: "2025-01-08T21:10:18Z"

The promotion process specifically looks for a taint with

NoSchedule

effect, but with a different key:

Copy code

# make sure we should not have any related label/taint on the node
      if [[ $ETCD_ONLY == false ]]; then
        found=$($KUBECTL get node $HOSTNAME -o yaml | $YQ '.spec.taints[] | select (.effect == "NoSchedule" and .key == "<http://node-role.kubernetes.io/etcd=true|node-role.kubernetes.io/etcd=true>") | .effect')
        if [[ -n $found ]]
        then
          $KUBECTL taint nodes $HOSTNAME <http://node-role.kubernetes.io/etcd=true:NoExecute-|node-role.kubernetes.io/etcd=true:NoExecute->
        fi
        $KUBECTL label --overwrite nodes $HOSTNAME <http://node-role.harvesterhci.io/witness-|node-role.harvesterhci.io/witness->
      fi

While the

etcd

pods have a toleration for the

NoExecute

taint, they don't have one for the

NoSchedule

taint, which is why I think they won't start on a cordoned node and as a result, a cordoned node won't successfully get promoted. There is also a Harvester and a Harvester webhook pod which can't be scheduled ever since

host05

was cordoned:

Copy code

...
  status:
    conditions:
    - lastProbeTime: "null"
      lastTransitionTime: "2025-01-08T21:10:18Z"
      message: '0/4 nodes are available: 2 node(s) didn''t match pod anti-affinity
        rules, 2 node(s) were unschedulable. preemption: 0/4 nodes are available:
        2 No preemption victims found for incoming pod, 2 Preemption is not helpful
        for scheduling..'
      reason: Unschedulable
      status: "False"
      type: PodScheduled

miniature-lock-53926

01/10/2025, 12:28 PM

Yeah right. But I thought that canceling the Upgrade should uncordon Host5 again and than the promotion could/should finish or am I missing something here? But the problem is also, that the 3rd control-plane node Host3 was removed from the Cluster on Tuesday 07.01 at around 4 PM (CET) but the promotion was not sucessfully finished on Wednesday 08.01 at around the same time. That was when we saw the promotion pod and thought it was stuck and apparently "retriggered" it turning Maintenance mode On and Off again on Host5

miniature-lock-53926

01/10/2025, 12:30 PM

But if we could get to this State again, by which I mean we stop the upgrade and somehow get the promotion rolling again. Even if there are errors that prevent it from finishing, we would have more information, what was going wrong with the promotion in the first place, am i right?

enough-australia-5601

01/10/2025, 12:32 PM

I don't think we're going to see much. The promotion job runs a script, which you'll find in the configmap

harvester-system/harvester-helpers

enough-australia-5601

01/10/2025, 12:34 PM

If you read that script and compare to the logs, it runs through almost all the way - except that for some reason it never finishes waiting at the end. Eventually the job is then killed.

miniature-lock-53926

01/10/2025, 12:34 PM

Yeah I have found that later and was also tempted to just run the script direktly on the node or to rerun the failed job again, but have not done either yet

miniature-lock-53926

01/10/2025, 12:35 PM

Ah ok and we still don't know what is preventing it from completing, which could be wrong taints/labels?

enough-australia-5601

01/10/2025, 12:42 PM

Re-running the script won't do much, since it won't solve the problem of why the finishing condition is never reached. My suspicion is that the taint on

host05

prevents the pods that make up a Harvester master node from being scheduled. As a result, the node never finishes the promotion. That's why I want to know why the taint was put there in the first place, because then I can maybe tell if it's safe to remove, which would perhaps unblock the promotion. Right now none of my colleagues are online, but later Ivan will be. I'll ask him what he thinks of this.

👍 1

miniature-lock-53926

01/10/2025, 12:49 PM

Ok. If you need any more information or want me to get some log or want to debug via a Screen Session just let me know. I will be available when your colleagues are on again. One last question, would you agree with the emerging consensus, that at least the upgrade could be aborted at this current state without making things much worse? And thank you so much too everyone that has helped so far. I am really grateful for all the detailed and knowledgeable Feedback from everyone. I really appreciate the effort even though we dug ourselves into quite a mess 🙂

millions-microphone-3535

01/10/2025, 6:16 PM

from what we can see, it looks like most of the components were already upgraded. i feel like the least invasive thing to do now is to add new 1.3.2 control plane nodes to the cluster, and let k8s finish scheduling those pending harvester pods.

miniature-lock-53926

01/10/2025, 6:19 PM

Hi again, thanks for chiming in. But now I have conflicting advice, because @bland-farmer-13503 and @happy-cat-90847 adviced against adding the third control-plane while the upgrade is still running, or did you also mean that I should abort the upgrade first?

miniature-lock-53926

01/10/2025, 6:20 PM

And is the correct way to abort/remove an upgrade to just delete the upgrade CR like

kubectl delete upgrade -n cattle-system hvst-upgrade-78k2f-prepare

millions-microphone-3535

01/10/2025, 6:32 PM

seems to me that most aspects of the cluster is already upgraded and the cluster is still reachable. the only thing that is missing is the pending harvester pods.

millions-microphone-3535

01/10/2025, 6:34 PM

i just saw PoAn's GH comment - maybe he can chime in when he's back online

👍 1

miniature-lock-53926

01/10/2025, 6:36 PM

I think so too. I am just being hesitant because some of your colleagues think that continuing while the upgrade is still running is a bad idea, and I am willing to do either option but I do not really have time pressure to get it back up ASAP and thats why I am willing to wait until everybody is on the same page or until the risks for each approach are clearer to me

miniature-lock-53926

01/10/2025, 6:38 PM

For that reason I am currently preparing a comprehensive update of the github issue where I try to collect the current findings and discuss the 2 alternatives and maybe get input from everyone by monday evening or tuesday

millions-microphone-3535

01/10/2025, 6:43 PM

last you posted, your upgrade is already at phase 4, with 'Upgrading System Service' already completed, right?

miniature-lock-53926

01/10/2025, 6:43 PM

yes

millions-microphone-3535

01/10/2025, 6:43 PM

https://docs.harvesterhci.io/v1.4/upgrade/troubleshooting#phase-4-upgrade-nodes

miniature-lock-53926

01/10/2025, 6:44 PM

this is the current state

miniature-lock-53926

01/10/2025, 7:04 PM

Ok I see you probably mean because we have probably entered Phase 4 this warning is relevant, correct?

miniature-lock-53926

01/10/2025, 7:08 PM

But I think we are still not completly finished with Phase3, although we are already at 100%

Upgrading System Services

because I manually uncordoned one of the nodes and that just for a couple of seconds let the 3 pending pods schedule successfully and that is when the Progessbar jumped to 100% The mcc-harvester bundle is still not finished though, and that is why I assume that we may not have really entered Phase4 yet or even if we have, nothing has so far been upgraded on the nodes yet

miniature-lock-53926

01/10/2025, 7:10 PM

In any case I just finished updaing the Issue and I am off now. As I said because the cluster is not really in Production yet I still have time to get a clearer picture and even if a suggested way, would lead to the loss of the cluster, we still could rebuild it without real problems, but we much rather try to save it. And I still have hope that it is recoverable. Thanks again for all the help so far and have a nice weekend

miniature-lock-53926

01/13/2025, 9:24 AM

Hello everybody. I wanted to ask if there a any new ideas or suggestions? Or if it is now clearer what our best course of action is? @bland-farmer-13503 @salmon-city-57654 @enough-australia-5601 @millions-microphone-3535 @happy-cat-90847

miniature-lock-53926

01/13/2025, 9:40 AM

I was just thinking it all through again. I am wondering if we should just try to add Host3 back again as a new 1.3.2 Controlplane node. Because, either it will work or I will not but the worst thing that could happen (in my mind), would probably be that we would have another machine/x-cluster CR stuck in provisioning State (which was the original Problem anyways)

enough-australia-5601

01/13/2025, 10:05 AM

Hi staedter, we've been chatting internally about this a bit and the consensus seems to be that in order to get anything moving forward, it's best to get back to three control-plane nodes. However it's a bit unclear how you can get there, because: 1. It's unclear what will happen with the promote job when a third control-plane node is joined 2. It's unclear if the promotion, or a join will succeed, since the

<http://rkecontrolplane.rke.cattle.io|rkecontrolplane.rke.cattle.io>

resource has been deleted 3. If joining another control-plane node, it's unclear (to me) if it should be of version v1.3.1 or v1.3.2 for best chances of success. I had asked some colleagues from the Rancher team about the

<http://rkecontrolplane.rke.cattle.io|rkecontrolplane.rke.cattle.io>

CRD late on Friday, but I haven't received an answer yet.

❤️ 1

miniature-lock-53926

01/13/2025, 10:21 AM

Hi Moritz, thank you for the update. I really appreciate all the timely feedback and your effort helping us. As I said we have the "luxury" to wait a bit more and as long as you are discussing it internally I can and will resist the urge to make it even worse ;)

miniature-lock-53926

01/13/2025, 10:31 AM

Just one small question? What makes you say that the rkecontrolplane has been deleted. As far as I understand it it is still "only" stuck in reconciling but it is still there and I don't see that a deletion has been initiated or maybe I have overlooked that

upgrade-stuck-rkecontrolplane.yaml

enough-australia-5601

01/13/2025, 10:41 AM

Is it not deleted? I was under the impression that it is. Sorry, this must have been a misunderstanding. From https://github.com/harvester/harvester/issues/7331 the reproduction steps:

3. Check the states of some Custom Ressources like Machines or RKEControlPlanes`and see that they are stuck in a`Provisioning`or`Reconciling` state .

4. Delete the stuck CRs, which triggers the deletion of one of the control-plane nodes.

To me this implied that the

rkecontrolplane

resource was deleted. But you're right, I should have double checked with the support bundle, it's indeed not deleted.

miniature-lock-53926

01/13/2025, 10:46 AM

Oh sorry I did not realize that I did not specify correctly which CRs I deleted. I will make it clearer in the original Issue ... I was talking about the stuck machines.x-cluster Ressources... not even I would have been foolish enough to delete the whole rkecontrolplane of the Harvester Cluster xD That happended to me once or twice in a downstream cluster, and I learned the hard way that there is no coming back from this (at least I have not found any way)

miniature-lock-53926

01/13/2025, 2:27 PM

@enough-australia-5601 Would that change anything in your opinion if the rkecontrolplane was not deleted?

enough-australia-5601

01/13/2025, 2:58 PM

Yes, it does indeed. It lowers my worries that joining another control-plane node may fail quite a bit. At the moment, I'm trying to go through that exact scenario in my virtual dev environment. I have set up a 3 control-plane, 2 worker Harvester v1.3.1 cluster and then deleted the

<http://machines.cluster.x-k8s.io|machines.cluster.x-k8s.io>

object belonging to one of the nodes. Then I'm trying to join back a new node in place of the old one. I'm ignoring the upgrade for now to make my test easier to setup. For a worker node joining a deleted machine back has worked flawlessly, but for a control-plane node I haven't seen it work well yet. But my first attempt wasn't clean as my workstation ran out of memory, so I'll try again. One thing I already noticed is that if one of the two remaining control-plane nodes experiences any kind of trouble the cluster pretty much immediately becomes inoperable, since the etcd store loses quorum.

miniature-lock-53926

01/13/2025, 3:02 PM

Yeah that would make sense... and that is why it is a good thing the the upgrade stalled where it has and has not yet drained and rebooted one of the control-plane nodes because then the etcd would have been inoperable, right?

enough-australia-5601

01/13/2025, 3:07 PM

Yes. And if the etcd becomes inoperable the situation will be a lot less enjoyable than what we have right now.

miniature-lock-53926

01/13/2025, 3:21 PM

Ok, got it. I will you let you test some more. I guess then the main question now would be what has better chances to succeed: add host3 with 1.3.1 or 1.3.2

👍 1

miniature-lock-53926

01/14/2025, 8:58 AM

Good morning. I wanted to ask if there are any new insights or suggestions? I am starting to receive a little internal pressure to get this issue resolved or at least provide a rough estimate when it might be available.

enough-australia-5601

01/14/2025, 2:45 PM

Hi, I tried reproducing your situation in my dev environment, but I couldn't get the exact same failure scenario, so I tried some simplified scenarios to see what should work and what certainly won't. Here are some insights: 1. You can remove nodes from a Harvester cluster by simply deleting the

<http://machines.cluster.x-k8s.io|machines.cluster.x-k8s.io>

resource. Once the cluster has finished reconciling (and the

node

resource is also gone), you can join back a new node under the same name by re-installing on fresh hardware and using the node-join-token. This works both for worker nodes as well as control-plane nodes (tested on v1.3.1 when no upgrade is running though). 2. I also tried deleting the

<http://machine.cluster.x-k8s.io|machine.cluster.x-k8s.io>

resource of a control-plane node during a running upgrade, but I likely did this during a different phase of the upgrade than you. I was able to join the node back using the previously described method of doing a clean install, using the node-join-token to join the node. Once my cluster had 3 control-plane nodes again, the upgrade erred out, but the cluster seemed healthy and I was able to re-start the upgrade. Unfortunately the second attempt at the upgrade didn't succeed (the API server kept crashlooping for ~2h before I pulled the plug on this experiment). During the first upgrade attempt, one of the worker nodes entered a failed state, but I was able to reboot it to get it back to a healthy state. I'm pretty sure this was a resource starvation problem. Ivan and Alejandro suggested when joining the third control-plane, to go directly with v1.3.2. You should be able to fetch the node join token out of

/etc/rancher/rancherd/config.yaml

on one of the existing control-plane nodes. I wish you good luck, since I can't give you a guarantee that this will resolve the cluster's problems.

❤️ 1

miniature-lock-53926

01/15/2025, 8:46 AM

Thank you. We are now getting ready to try this and get some of the more important test data backed up and than will try to rejoin host 3 with version 1.3.2 I will keep you updated here and in the Github Issue if encounter more problems or if the the plan worked out.

miniature-lock-53926

01/15/2025, 10:23 AM

Ok, now apparently we have another problem just getting the installation iso to properly run No matter what network settings we are trying to configure we are getting this error. Any Idea what this new problem is now about. We made sure that the settings are correct and are the same ones we have saved from the node beforehand from the 90_custom.yaml and also harvester.config What is ´yip´ in the first place and what could be the problem here?

miniature-lock-53926

01/15/2025, 10:37 AM

@enough-australia-5601 Do you have any Idea? We have never seen this before (and we have installed from ISO at least 10-15 Times already) and our only idea would be that somehow the installation medium is corrupt yet still booted... We are off to a bad start already 😞

enough-australia-5601

01/15/2025, 10:53 AM

yip

is the cloud-init clone that is used by Elemental, which is the base OS installer used in Harvester: https://github.com/rancher/yip But I don't think that is really the root problem here. I'm also assuming that you're using network settings that you already know are good. Are you using a remote-mounted ISO image?

miniature-lock-53926

01/15/2025, 10:54 AM

We have just retried it with a remote-mounted image an got the same error. We had problems with the remote media in the past, that is why we switched to USB sticks

miniature-lock-53926

01/15/2025, 10:55 AM

And we wanted to make sure the the usb-stick was not corrupted and would have asked our admins to flash a new one if the installation via remote media would have progressed further but still the same issue appeared

miniature-lock-53926

01/15/2025, 10:59 AM

The network config is correct, we double-checked and also tried other configuration... still always the same error.... the only idea that I just thought about would be maybe before hand wipe the whole installation partition before installing it again. I thought that was not necessary because AFAIK the installation will wipe the partion before installing anyways but maybe we could try that

enough-australia-5601

01/15/2025, 10:59 AM

Remote mounted media can show this kind of issue, if it times out.

miniature-lock-53926

01/15/2025, 10:59 AM

we are currently trying another version of the installation medium remote mounted just to see if the issue persists there...

miniature-lock-53926

01/15/2025, 11:00 AM

Ah ok good to know... but this should not happen with an local usb stick right?

enough-australia-5601

01/15/2025, 11:00 AM

And no, the installation will not necessarily wipe the partition table. It's optional, and IIRC only some of the more recent installers support it at all.

enough-australia-5601

01/15/2025, 11:01 AM

Local USB sticks should work.

miniature-lock-53926

01/15/2025, 11:02 AM

I actually would rather not wipe the whole disk becaus we used 2.4 TB of this disk as the default longhorn disk... most our data is on extra disk that are served by a custom storage-class but there might still be data on that partition but we will try this too if it gets the installation rolling

enough-australia-5601

01/15/2025, 11:06 AM

If the issue persists with USB sticks, I'd first try to find out what exactly the error is. You can log into the running installer image: https://docs.harvesterhci.io/v1.3/troubleshooting/index/#logging-into-the-harvester-installer-a-live-os Then check the usual places for logs etc. If all else fails, you generate a tarball with debug info with:

Copy code

supportconfig -k -c

in the installer.

enough-australia-5601

01/15/2025, 11:07 AM

You are now trying to install on the same hardware that used to be

host03

, right?

miniature-lock-53926

01/15/2025, 11:07 AM

I just doublechecked unfortunately we included the default disks into our storage-class. We still can wipe the disk... then the longhorn data has to rebuild the missing replicas, cant be helped...

miniature-lock-53926

01/15/2025, 11:09 AM

ah ok thanks for the how-to we will try to find the logs and error... where would the harvester-installation logs be located... we didnt find anything under /var/log/

miniature-lock-53926

01/15/2025, 11:11 AM

Copy code

You are now trying to install on the same hardware that used to be host03, right?

Yes it is the exact hardware where the old node was running

miniature-lock-53926

01/15/2025, 11:13 AM

It is also still there and even boots successfully but is then running as a single-node

miniature-lock-53926

01/15/2025, 11:16 AM

Thats why maybe we should really first delete the data from the installation partition to make sure the old installation is not making any problems... although it should not

miniature-lock-53926

01/15/2025, 11:21 AM

The issue persists even with the usb-stick... we are trying to find the logs now and will be creating a supportconfig tarball

👍 1

miniature-lock-53926

01/15/2025, 11:37 AM

Ah ok... we previously had also one boot medium small disk. and we are now seeing that for whatever reason this disk with this old installation medium is mounted at /run/intitramfs/live ... which should not be the case...

miniature-lock-53926

01/15/2025, 11:38 AM

But we are definitivley selecting the usb stick to boot from the BBS menu and in the resulting grub selection we selected harvester 1.3.2

miniature-lock-53926

01/15/2025, 11:39 AM

We will try to delete this old installation medium and retry again... that would also explain why this problem persists with different installation media ... but that in and of itself seems to be a bug to me

miniature-lock-53926

01/15/2025, 11:41 AM

maybe then the usb stick was not properly flashed and that somehow results in this weird behavior but that still make no sense to us

miniature-lock-53926

01/15/2025, 11:45 AM

Weird but that would explain this problem atleast

enough-australia-5601

01/15/2025, 11:53 AM

Weird. Is

nvme9n1

an NVMe-oF device or something like that? The Harvester installer can be quite tricky if there are things like that floating around, since it mounts partitions by label. Usually it's not a problem, but in some cases the EFI may expose partitions whose labels match.

miniature-lock-53926

01/15/2025, 12:01 PM

Yes we think that might have been the problem now and also might have explained other weird phenomena in the past

miniature-lock-53926

01/15/2025, 12:04 PM

we will wipe this disk an all other nodes too... now the installation from usb also takes a lot longer and finally failed to boot with this error

miniature-lock-53926

01/15/2025, 12:05 PM

because the medium is probably not properly flashed, as we first suspected. but than had the problem that the installation process did find the old disk by label... yeah another weird edge-case... is there an award for that?

miniature-lock-53926

01/15/2025, 12:46 PM

Ok now we have finally booted into the right ISO and the installation via remote-medium is running (we had no one onsite to flash a new usb stick and we just hope our vpns will stay up)

enough-australia-5601

01/15/2025, 12:51 PM

Crossing my fingers for you 🤞

❤️ 1

miniature-lock-53926

01/15/2025, 1:30 PM

ok, so the upgrade with 1.3.2 on host3 with a remote-mounted medium was successfully and we are now waiting here for like 5 minutes already

miniature-lock-53926

01/15/2025, 1:31 PM

I am wondering if it is a bad sign that the node apparently cannot find the harvester vip

enough-australia-5601

01/15/2025, 1:33 PM

Do you see the node in the cluster web UI, or is it not there either?

miniature-lock-53926

01/15/2025, 1:33 PM

ah ok... we should have used the fqdn i guess... but it was the same value we had set before in the original harvester.config

miniature-lock-53926

01/15/2025, 1:34 PM

The harvester cluster web gui is still running and we were watching the cluster via kubeconfig closely which is also using the harvester VIP

enough-australia-5601

01/15/2025, 1:37 PM

I mean, does the node show up when you do a

kubectl get nodes

miniature-lock-53926

01/15/2025, 1:37 PM

we will try to change the server url in the harvester.config and 90_custom.yaml to match the fqdn

miniature-lock-53926

01/15/2025, 1:37 PM

no not yet

enough-australia-5601

01/15/2025, 1:39 PM

Looks like going with the FQDN is the way.

👍 1

miniature-lock-53926

01/15/2025, 1:40 PM

rebooting it now

miniature-lock-53926

01/15/2025, 1:49 PM

ok it now the rancher-system-agent has started but it is taking a while and has errors

enough-australia-5601

01/15/2025, 1:58 PM

I wouldn't be worried about seeing those errors a few times. They basically tell you that something that acts like a K8s client failed to watch for a resource, which can happen for a variety of common reasons and the client should know to to handle it and retry. You should see the node joining the cluster

miniature-lock-53926

01/15/2025, 2:00 PM

How long should we wait before we see anything inside the harvester cluster? The management url is still "Not ready" as before and the watch on

k get node

and

k get pod -A --field-selector status.phase!=Running

has not shown any change at all and the kubelet throws the same errors every other minute

enough-australia-5601

01/15/2025, 2:09 PM

Maybe like five minutes? You did choose to "join an existing Harvester cluster" in the installer, right?

miniature-lock-53926

01/15/2025, 2:09 PM

You used version 1.3.1 for your tests, right? Maybe that was right after all?

miniature-lock-53926

01/15/2025, 2:09 PM

Yes of course 🙂

miniature-lock-53926

01/15/2025, 2:09 PM

And used the old join token, which we verified on one of the other nodes

enough-australia-5601

01/15/2025, 2:10 PM

Indeed, I only used v1.3.1

miniature-lock-53926

01/15/2025, 2:13 PM

Probably because the k8s components on the remaining control-plane nodes were not upgrade and are still running in older versions

miniature-lock-53926

01/15/2025, 2:16 PM

I guess this is not the right k8s version for harvester 1.3.2 but 1.3.1 right?

enough-australia-5601

01/15/2025, 2:27 PM

Yep.

v1.27.13

is the RKE2 version that powers Harvester v1.3.1

miniature-lock-53926

01/15/2025, 2:52 PM

Ok we are trying it with 1.3.1 now... because of remote situation problem we wanted to try the netinstall version but had ah dev/kvm error that we have not encountered before and are now trying again with the regular iso but then then installation alone takes almost an hour

enough-australia-5601

01/15/2025, 3:06 PM

Yeah, unfortunately the installation is everything but quick.

miniature-lock-53926

01/15/2025, 3:24 PM

Ok, I guess now we have the same problem as before, that there is yet again another installation medium (the faulty usb stick that does not boot) that is picked up by the installation routine, so we are not sure which version is actually used during installation, and we assume that the

dev/kvm is missing

error is just a symptom of that, like with the

yip version -g

error before- so we just have to wait for some hands on-site to update the usb stick to 1.3.1 and flash it with rufus and we will continue here tomorrow Not great; not terrible ... atleast we have not made things worse yet

miniature-lock-53926

01/16/2025, 9:14 AM

Good morning. We have now booted and installed from a bootable USB Stick with Version 1.3.1 but we still are seeing the same error in the racher-system-agent.service

Copy code

Jan 16 09:12:44 haa-devops-harvester02-host03 rancher-system-agent[4234]: W0116 09:12:44.159850   4234 reflector.go:456] pkg/mod/github.com/rancher/client-go@v1.27.4-rancher1/tools/cache/reflector.go:231: watch of *v1.Secret ended with: an error on the server ("unable to decode an event from the watch stream: stream error: stream ID 29; INTERNAL_ERROR; received from peer") has prevented the request from succeeding

miniature-lock-53926

01/16/2025, 9:15 AM

That is also why the rancherd is not progressing further because of

miniature-lock-53926

01/16/2025, 9:22 AM

In the Harvester cluster itself is also no change whatsoever... no sign that there is even an attempt to join

miniature-lock-53926

01/16/2025, 9:29 AM

@enough-australia-5601 Could it be that the cluster still somehow remembers the old node and thats why it is not initiating the join? We have found this in the logs

Jan 16 09:22:09 haa-devops-harvester02-host03 rancherd[7576]: time="2025-01-16T09:22:09Z" level=info msg="[stdout]: [INFO]  Cattle ID was already detected as 429ae3dd34e98159681beb04658a5deda7d408fd1a3c95b1e3924418205c10b. Not generating a new one."

miniature-lock-53926

01/16/2025, 9:29 AM

Maybe we need to remove that first?

miniature-lock-53926

01/16/2025, 9:40 AM

Ah no, sorry... ok that was from a restart of the service after the rancherd was stuck for more than 15 minutes... in the first run it created a new cattle-id

miniature-lock-53926

01/16/2025, 9:44 AM

We are wondering if the missmatch between the rancher version 2.8.3 for 1.3.1 and the 2.8.5 for 1.3.2 could also be a problem? Also we don't really see the resources like secret/harvester-cluster-repo in the harvester cluster

enough-australia-5601

01/16/2025, 10:45 AM

Good morning. What does

journalctl -u rke2-server.service

show? Is that unit even running?

miniature-lock-53926

01/16/2025, 10:57 AM

no it is not even running

miniature-lock-53926

01/16/2025, 1:05 PM

My biggest concern is still that the managment-url is never healthy since we deleted host3 and that is probably why we cannot join a new node... and is also probably the reason the original promotion job has not finished

enough-australia-5601

01/16/2025, 1:39 PM

Why would the management URL not be healthy? The API service should be redundant between the control-plane nodes and the ingress should switch to one of the remaining nodes as a backend. Can't you reach the Kubernetes API?

miniature-lock-53926

01/16/2025, 1:48 PM

We can reach the API but the management api is unhealthy...

enough-australia-5601

01/16/2025, 2:01 PM

But this is on

host03

. What about the other hosts? There the management API should be healthy.

miniature-lock-53926

01/16/2025, 2:02 PM

ah ok sorry for the misunderstanding... I have to look in the IPMI for another server...

enough-australia-5601

01/16/2025, 2:16 PM

You should be able to

curl

it from

host03

miniature-lock-53926

01/16/2025, 2:17 PM

yeah but how do i authenticate from it?

miniature-lock-53926

01/16/2025, 2:17 PM

I dont have a rke2.yaml yet on that host

enough-australia-5601

01/16/2025, 2:20 PM

You don't need to. This shows us that

host03

is able to connect to the API and that the API is there. So something else failed when re-installing

host03

causing it to not be able to join the cluster.

miniature-lock-53926

01/16/2025, 2:42 PM

Ok but from one of the healthy nodes Servers Dashboard looks like this. What exactly determines if the management-url is ready from a nodes perspective?

miniature-lock-53926

01/16/2025, 2:48 PM

BTW we are currently preparing ourselves to just install 1.4.0 on the old host3 and bootstrap a whole new cluster while another team will run their longrunning tests on the stuck cluster and after those have finished just migrate everything over while both clusters are running in parallel

enough-australia-5601

01/16/2025, 2:58 PM

Yeah... I was wondering how much pain you're willing to go through with this cluster, especially since I though this was an evaluation cluster, not for production workloads. The dashboard on the console literally just does a

curl

on the management URL and checks if the return code of that process indicates success or not.

miniature-lock-53926

01/16/2025, 2:59 PM

Well it was almost production-ready... the last thing before our new flag-ship application was supposed to run there was the those tests, that thankfully are still able to run and are producing really promising results... and the upgrade to 1.4.0 xD

miniature-lock-53926

01/16/2025, 3:00 PM

And until today I thought a new cluster would mean, that we would have to postpone the go-live... which would have been not a good thing to have report to my higher ups

miniature-lock-53926

01/16/2025, 3:11 PM

at least we learned a lot and also know now what NOT to do... so if we get a new cluster running by the end of next week and can migrate the application we should be stillt on track

miniature-lock-53926

01/16/2025, 3:15 PM

The dashboard on the console literally just does a curl on the management URL and checks if the return code of that process indicates success or not.

But I still don't understand why is the curl then working on the old control-plane nodes like host4 but not on host3 that wants to join

enough-australia-5601

01/16/2025, 3:26 PM

I just skimmed the code real quick. The

curl

is the last step in a series of checks. These checks are all looking for objects in the K8s API. These checks all work together to make sure that the status is displayed correctly whether you're looking at the dashboard on the first node of a cluster or whether you're looking at the dashboard of a worker node, etc. Not sure why there needs to be another

curl

request at the end right now. But if any of these checks fail for any reason, the dashboard will not show the cluster as "Ready". So in a way it's showing the correct info.

host03

isn't ready because it's not properly joined in the cluster, but the other are all showing ready because the cluster is essentially still operating.

miniature-lock-53926

01/16/2025, 4:05 PM

yes that makes sense... but how would a new node check on k8s ressources via the api when it cannot authenticate. this is still a mistery to me

enough-australia-5601

01/16/2025, 4:21 PM

It just won't be ready until it has finished joining the cluster. Part of the joining process is to configure these authentication credentials.

miniature-lock-53926

01/16/2025, 4:22 PM

ah ok got it

miniature-lock-53926

01/16/2025, 4:57 PM

We were for the first time able to iPXE boot now into harvester 1.4.0 installation via netboot.xyz ... so everything looks promising that we can just setup the new cluster alongside the current cluster, at our own pace and completely remote....also the Product teams test was also successfully already and I think we will just try moving forward and try to learn from our mistakes. I am a little disappointed, that we were not able to rescue the cluster, because it feels like this state should be recoverable, but it was mostly my fault and maybe there were already some underlying problems in our cluster that contributed. A fresh install in with the newest version give me a lot more confidence that our applications will run on top without major problems and that is that. I appreciate all the help from everybody and especially @millions-microphone-3535 and @enough-australia-5601 I will update the GH Issue when everything is resolved and try to summarize our findings when I have some more time Best regards and thank you

👍 1

70 Views

Open in Slack

Previous Next