This message was deleted.
# harvester
a
This message was deleted.
a
Depending on how the node was originally set up, you may not be able to. Ref: https://docs.harvesterhci.io/v1.4/host/#role-management โ€ข Worker: Restricts a node to being a worker node (never promoted to management node) in a specific cluster. I am interested to see if anyone has found a method for doing this (short of removing the worker node and rebuilding it with the "Manager" role).
m
The Problem is, that a third control-plane node was originally deployed but was deleted when trying to start an upgrade and now the automativ promotion job was running but was not successful in promoting the node... Now the upgrade startet but it is stuck because there is only 2 control-plane nodes but it needs to schedule three pods for Rancher, Harvester and Harvester-Webhook I am at a loss
I thought about this and maybe I should add the deleted node back again with the management role... maybe that fixes it
But I think it should be possible the just trigger a manual promotion of one of the remaining nodes... I am remote and cannot easily get the correct installation medium on the deleted node, otherwise I would have tried this already
I also just finished writing it up and created an Issue on Github for more clarity https://github.com/harvester/harvester/issues/7331
๐Ÿ‘ 1
a
Best of luck, and please share findings/lessons learned. I typically deploy Harvester as a 3-node cluster (all with manager roles) in a lab environment, but have been considering moving to 5 nodes with the manager role for stability. The underlying etcd being potentially corrupt or left in a RO state if I lose a node can make for sleepless nights in production. Backing up etcd should help.
๐Ÿ™ 1
m
Thanks. I am hoping Rancherlabs can help here... I really like the whole Rancher, Harvester, RKE2 Ecosystem but sometimes one feels really helpless when the stuff is not working automagically as it usually does
๐Ÿ’ฏ 2
h
@miniature-lock-53926 why upgrade when you have a split control plane? The team could help if the promotion failed. You should be able to cancel the upgrade - at what point is it stuck?
m
I really did not want to upgrade but it started by accident while I was preparing the github issue and wanted to take a screenshot of the original error. We were already discussing aborting the upgrade but wanted to wait for a response from Rancherlabs.
I think we are stuck at end of phase 3
If it is possible to do it safe at this phase we should abort it here and as you suggested first try to get the controlplane back up to 3 nodes
h
There will certainly be a need to have a proper cluster. Where is the upgrade stuck on?
m
It is stuck trying to schedule a third instance of harvester, harvester-webhook and rancher
h
Ok, I see. Weโ€™d have to see what others say. It looks like everything upgraded already. Itโ€™s just the actual nodes at the moment. Have you tried to join it back in?
Got it.
m
It was the host3 ... we accidentially deleted it while deleting a machine/x-cluster CR that was stuck in reconciling state and afterwards we deleted it cleanly from the cluster... it is still there and has harvester 1.3.1 installed but will not rejoin the cluster
We were considering it but are not sure if it will work while the upgrade is running
h
Iโ€™d wipe it out and try join it. But with the version you are upgrading to.
m
and we are still not sure if it is safe to abort the upgrade
h
Yes, it should be. The services were upgraded. Itโ€™s just the nodes. But perhaps wait for someone to chime in.
m
You just did chime in... but you probably mean some on the github issue, right?
How would we abort the upgrade?
e
Why is
haa-devops-harvester02-node02
and
haa-devops-harvester02-node05
cordoned? Was this part of the upgrade or was this done manually?
m
No this is done via the update. If I uncordon it manually it deploys the pending pods for a minute and then cordons the node again and reschedules the pods
Thats also why our uprade is stuck at Upgrading System Services 100% instead of 50% when it has never been able to schedule all pods at least once. That is also why I think if we could scale the deployments of Rancher, Harvester and Harvester-Webhook to 2 Replicas, the upgrade would also finish successfully, but because the deployments are controlled with helm deployed with the mcc-harvester fleet-bundle I also cannot permanently scale the Deployments to 2 Replicas
s
Could you generate the SB?
Looks like node2 is stuck on draining?
m
We already have. I sent Rancherlabs the download link to harvester-support-bundle@suse.com today because I could not send the SB via email directly
but I can also upload it here if that is more conveniant
s
You can upload SB here. I thought that would be more convenient.
This is the upgrade log. Could you generate support bundle? REF: https://docs.harvesterhci.io/v1.3/troubleshooting/harvester#generate-a-support-bundle
m
I picked the wrong file, it is already uploading ๐Ÿ™‚
๐Ÿ‘ 1
s
No worries, the upgrade log also helps. But SB could simulate your current environment.
๐Ÿ‘ 1
m
Ah, I never knew you had like a lab-simulator where you can drop the SB into... nice
This is the supportbundle before the upgrade btw... should I create a new one to get the current state then?
s
Just in my local environment. ๐Ÿ˜† https://github.com/rancher/support-bundle-kit could help simulate.
โค๏ธ 1
Sure, please generate a new one to get the current state.
m
That is quite useful. BTW are there any secrets embedded in the SB that should be changed afterwards or is it not necessary?
s
No, we did not collect the secrets
m
s
hmm, did you want to upgrade or drop the upgrade?
Looks like the drain is stuck because the im-pod cannot be drained.
m
If we could drop it safely I think we would prefer it
Than we could fix the controlplane and retry it again
What do you mean by im-pod?
s
The instance-manager pod (from Longhorn) was protected by PDB until all replicas on the draining node were moved to other nodes.
so you want to upgrade from v1.3.1 -> 1.3.2?
m
Ah ok, I did not see that. What would you recommend? Roll back or go forward?
Yes but ultimately to 1.4.0
because of this bug, which we have https://github.com/harvester/harvester/issues/7021
s
Most of the components were upgraded. So, I am not really sure whether dropping the upgrade is good or not.
m
Well then we would need to shake something lose to get the upgrade rolling again, I guess
I think this is also something that would need to finish sucessfully for Phase4 to start, right?
s
Hmm, I donโ€™t think this fleet complaint would affect the upgrade at this moment.
But I am worried that if we only have 2 nodes in etcd, once one of the control plane node rebooted, we might lost the cluster temporarily (because the etcd cluster might not be alive)
m
yeah that is why I originally thought that it would be the first priority to get one of the 2 worker nodes promoted to control-plane but I dont know how I would do that... except for retriggering the failed promotion job for host5
Also I dont know if the running upgrade would even allow another node to be promoted at this point
TBH I don't even understand how we ended up in this weird edgecase state. It should not even be possible as far as I understood it to have a HA Cluster only running with 2 control-plane nodes... I know that I am probably at fault for getting us into this edge-case but at the same time it feels, that this should be a recoverable state.... especially because after accidentally triggering the deletion of one of the control-plane nodes, an promotion job for another node was already triggered but just did not succeede, which probably could be a bug in and of itself
s
I am checking the code. IIUC, the promote controller should not be blocked during the upgrade so I am checking.
โค๏ธ 1
m
The running promote pods had specific errors but we thought maybe those might be "normal" and decided to let it run a litte longer but then those pods were deleted and only the failed job remained, which I did not notice until yesterday when it was to late
s
The running promote pods had specific errors
Did you mean you have logs of promote pod?
Ohโ€ฆ I found that one on old SB
๐Ÿ™Œ 1
m
thats why I was thinking about just rerunning the promote job, but that is even better
s
itโ€™s weird, the corresponding label was not be added
From the pod logs, it should be added
Did you manually change anything on Node CR of node5?
m
yeah, our conclusion was, that because the x-cluster CRs and the RKEControlPlane local was not finished reconciling that this blocked the promotion from completing
Did you manually change anything on Node CR of node5?
not that I remember
s
From the log
Waiting for promotion...
That means everything should be settled. We just wait for the status change.
m
yeah well after the promotion pods were no longer running, I just assumed that the promotion was then successful but host 5 didnt have the right node-roles.... and that is when I started the writeup of the Issue and while trying to get a screenshot of the original error, this upgrade was triggered accidentally
s
hmm, if the promotion is successful, the promote job should be completed
m
but it did not. Do you think that the Reconciling Cluster CRs were at fault and should I try rerunning the promotion job?
s
I am checking the label, seems some labels were correct
a
@miniature-lock-53926 Reading your comment of, "TBH I don't even understand how we ended up in this weird edgecase state." reminded me that I ran into a similar state that led to my discovering that one of my three Dell R740XD2's had a different processor generation than the other two. This led to the upgrade VM being unable to migrate to the server with the different processor and resulted in a hung upgrade state. Ref: https://github.com/harvester/harvester/issues/7096 Not saying that this is your issue, but something to watch out for (heterogenous hardware).
โค๏ธ 1
s
I need some time to discuss this situation (means promotion failure). Will update here tomorrow.
โค๏ธ 1
Looks like the labels were correct
m
Perfect I also am finished with work today, I will be here again tomorrow morning CET
๐Ÿ‘ 1
e
especially because after accidentally triggering the deletion of one of the control-plane nodes
Deleting a single control-plane node in a cluster with three control planes is a really awkward edge case. etcd requires the majority of control-plane nodes to be in good working order to maintain quorum and accept writes, otherwise it will fall back into a read-only mode to preserve data consistency and avoid a split-brain scenario. The minimum number of nodes required for etcd can not be changed willy-nilly, usually it can only be done by backing up the data and restoring into a completely fresh etcd instance. With one out of three control-plane nodes gone, this puts the etcd in a weird spot, where it requires both other nodes to still be there to maintain the quorum. I found an older Rancher issue about promoting RKE2 workers to master nodes, but none of what's discussed there seems to be out of whack in this case. https://github.com/rancher/rancher/issues/36480#issuecomment-1039253499 I'll need some more time to look for a smoking gun and check back with Vincente before I can advise about what to do next
โค๏ธ 1
m
@miniature-lock-53926 so from the support bundle, i can see that
host05
was scheduled for node promotion, but what i don't get is why (from the kube-controller-manager logs) the promotion job was re-enqueued multiple times (within a 5-minute timeframe) even after a promotion job pod was started
that support bundle didn't have the logs of the promotion job pods so i couldn't tell if they finished, failed, or what
AIUI, the worker-to-control-plane promotion should be automatic, if a control plane node was removed, per https://docs.harvesterhci.io/v1.4/host/#1-check-if-the-node-can-be-removed-from-the-cluster
you mentioned you manually restarted the promotion job.. can you grab the logs from those pods? they will be in the
harvester-system
namespace and should be named with the prefix
harvester-promote-haa-devops-harvester02-host05-
as far as upgrade is concerned, i think, if possible, you should try to add 2 more control plane nodes to the cluster to get the cluster back to a usable state first, before we attempt to suggest any more changes, per https://rancher-users.slack.com/archives/C01GKHKAG0K/p1736430602309099?thread_ts=1736364466.420119&cid=C01GKHKAG0K
m
I am actually not 100% sure what the promotion pod did the first time, but I think I remember it had really specific errors, that convinced us that the promotion was stuck and thats why we activated and deactivated the the maintenance mode on host 5 in an attempt retrigger the promotion pod after it got stuck or (now come to think of it) maybe was just taking longer than we expecteded while throwing unrelated/unimportant error or warnings and we just thought it was stuck. At that point we did not understand what controls the promotion and were not aware that there was a promotion job in the first place. In any case that again restarted the promotion process and new promotion pods were being created and afterwards we decided to wait longer this time and after a while no promotion pod was running anymore and thats why I originally thought the promotion went through this time until I found the failed promotion job after the upgrade was started.
I was thinking about retriggering the promotion job by just running the failed manifest again to get the promotion logs again, but so far did not do that because I wanted to get your input first.
So I shoud NOT just try to install Host3 from scratch to Version 1.3.2 and try to rejoin the cluster WHILE the upgrade is running but rather abort the upgrade at this point which is safe and could also fix the controlplane again or at least make it possible to join a new control-plane node by installing 1.3.2 and joining the cluster, correct? How would I aboirt the upgrade is it just
kubectl delete upgrade -n cattle-system hvst-upgrade-78k2f-prepare
?
e
I still don't understand why
host05
is cordoned. It doesn't seem like it's being drained, but I think this may pose a problem with scheduling the required pods to complete the promotion, because while they may tolerate taints with
NoExecute
effect, they may not tolerate a taint with
NoSchedule
effect:
Copy code
โ”‚ ~/Downloads/stuck_upgrade/supportbundle_207a51d7-61ff-4f36-8785-38454b6ce253_2025-01-09T15-25-27Z โ”‚ 130 โ–บ yq '.items[] | {"name": .metadata.name, "taints": .spec.taints}' yamls/cluster/v1/nodes.yaml 
name: haa-devops-harvester02-host01
taints: null
name: haa-devops-harvester02-host02
taints:
  - effect: NoSchedule
    key: <http://kubevirt.io/drain|kubevirt.io/drain>
    value: draining
  - effect: NoSchedule
    key: <http://node.kubernetes.io/unschedulable|node.kubernetes.io/unschedulable>
    timeAdded: "2025-01-08T21:14:36Z"
name: haa-devops-harvester02-host04
taints: null
name: haa-devops-harvester02-host05
taints:
  - effect: NoSchedule
    key: <http://node.kubernetes.io/unschedulable|node.kubernetes.io/unschedulable>
    timeAdded: "2025-01-08T21:10:18Z"
The promotion process specifically looks for a taint with
NoSchedule
effect, but with a different key:
Copy code
# make sure we should not have any related label/taint on the node
      if [[ $ETCD_ONLY == false ]]; then
        found=$($KUBECTL get node $HOSTNAME -o yaml | $YQ '.spec.taints[] | select (.effect == "NoSchedule" and .key == "<http://node-role.kubernetes.io/etcd=true|node-role.kubernetes.io/etcd=true>") | .effect')
        if [[ -n $found ]]
        then
          $KUBECTL taint nodes $HOSTNAME <http://node-role.kubernetes.io/etcd=true:NoExecute-|node-role.kubernetes.io/etcd=true:NoExecute->
        fi
        $KUBECTL label --overwrite nodes $HOSTNAME <http://node-role.harvesterhci.io/witness-|node-role.harvesterhci.io/witness->
      fi
While the
etcd
pods have a toleration for the
NoExecute
taint, they don't have one for the
NoSchedule
taint, which is why I think they won't start on a cordoned node and as a result, a cordoned node won't successfully get promoted. There is also a Harvester and a Harvester webhook pod which can't be scheduled ever since
host05
was cordoned:
Copy code
...
  status:
    conditions:
    - lastProbeTime: "null"
      lastTransitionTime: "2025-01-08T21:10:18Z"
      message: '0/4 nodes are available: 2 node(s) didn''t match pod anti-affinity
        rules, 2 node(s) were unschedulable. preemption: 0/4 nodes are available:
        2 No preemption victims found for incoming pod, 2 Preemption is not helpful
        for scheduling..'
      reason: Unschedulable
      status: "False"
      type: PodScheduled
m
Yeah right. But I thought that canceling the Upgrade should uncordon Host5 again and than the promotion could/should finish or am I missing something here? But the problem is also, that the 3rd control-plane node Host3 was removed from the Cluster on Tuesday 07.01 at around 4 PM (CET) but the promotion was not sucessfully finished on Wednesday 08.01 at around the same time. That was when we saw the promotion pod and thought it was stuck and apparently "retriggered" it turning Maintenance mode On and Off again on Host5
But if we could get to this State again, by which I mean we stop the upgrade and somehow get the promotion rolling again. Even if there are errors that prevent it from finishing, we would have more information, what was going wrong with the promotion in the first place, am i right?
e
I don't think we're going to see much. The promotion job runs a script, which you'll find in the configmap
harvester-system/harvester-helpers
If you read that script and compare to the logs, it runs through almost all the way - except that for some reason it never finishes waiting at the end. Eventually the job is then killed.
m
Yeah I have found that later and was also tempted to just run the script direktly on the node or to rerun the failed job again, but have not done either yet
Ah ok and we still don't know what is preventing it from completing, which could be wrong taints/labels?
e
Re-running the script won't do much, since it won't solve the problem of why the finishing condition is never reached. My suspicion is that the taint on
host05
prevents the pods that make up a Harvester master node from being scheduled. As a result, the node never finishes the promotion. That's why I want to know why the taint was put there in the first place, because then I can maybe tell if it's safe to remove, which would perhaps unblock the promotion. Right now none of my colleagues are online, but later Ivan will be. I'll ask him what he thinks of this.
๐Ÿ‘ 1
m
Ok. If you need any more information or want me to get some log or want to debug via a Screen Session just let me know. I will be available when your colleagues are on again. One last question, would you agree with the emerging consensus, that at least the upgrade could be aborted at this current state without making things much worse? And thank you so much too everyone that has helped so far. I am really grateful for all the detailed and knowledgeable Feedback from everyone. I really appreciate the effort even though we dug ourselves into quite a mess ๐Ÿ™‚
m
from what we can see, it looks like most of the components were already upgraded. i feel like the least invasive thing to do now is to add new 1.3.2 control plane nodes to the cluster, and let k8s finish scheduling those pending harvester pods.
m
Hi again, thanks for chiming in. But now I have conflicting advice, because @bland-farmer-13503 and @happy-cat-90847 adviced against adding the third control-plane while the upgrade is still running, or did you also mean that I should abort the upgrade first?
And is the correct way to abort/remove an upgrade to just delete the upgrade CR like
kubectl delete upgrade -n cattle-system hvst-upgrade-78k2f-prepare
?
m
seems to me that most aspects of the cluster is already upgraded and the cluster is still reachable. the only thing that is missing is the pending harvester pods.
i just saw PoAn's GH comment - maybe he can chime in when he's back online
๐Ÿ‘ 1
m
I think so too. I am just being hesitant because some of your colleagues think that continuing while the upgrade is still running is a bad idea, and I am willing to do either option but I do not really have time pressure to get it back up ASAP and thats why I am willing to wait until everybody is on the same page or until the risks for each approach are clearer to me
For that reason I am currently preparing a comprehensive update of the github issue where I try to collect the current findings and discuss the 2 alternatives and maybe get input from everyone by monday evening or tuesday
m
last you posted, your upgrade is already at phase 4, with 'Upgrading System Service' already completed, right?
m
yes
m
this is the current state
Ok I see you probably mean because we have probably entered Phase 4 this warning is relevant, correct?
But I think we are still not completly finished with Phase3, although we are already at 100%
Upgrading System Services
because I manually uncordoned one of the nodes and that just for a couple of seconds let the 3 pending pods schedule successfully and that is when the Progessbar jumped to 100% The mcc-harvester bundle is still not finished though, and that is why I assume that we may not have really entered Phase4 yet or even if we have, nothing has so far been upgraded on the nodes yet
In any case I just finished updaing the Issue and I am off now. As I said because the cluster is not really in Production yet I still have time to get a clearer picture and even if a suggested way, would lead to the loss of the cluster, we still could rebuild it without real problems, but we much rather try to save it. And I still have hope that it is recoverable. Thanks again for all the help so far and have a nice weekend
Hello everybody. I wanted to ask if there a any new ideas or suggestions? Or if it is now clearer what our best course of action is? @bland-farmer-13503 @salmon-city-57654 @enough-australia-5601 @millions-microphone-3535 @happy-cat-90847
I was just thinking it all through again. I am wondering if we should just try to add Host3 back again as a new 1.3.2 Controlplane node. Because, either it will work or I will not but the worst thing that could happen (in my mind), would probably be that we would have another machine/x-cluster CR stuck in provisioning State (which was the original Problem anyways)
e
Hi staedter, we've been chatting internally about this a bit and the consensus seems to be that in order to get anything moving forward, it's best to get back to three control-plane nodes. However it's a bit unclear how you can get there, because: 1. It's unclear what will happen with the promote job when a third control-plane node is joined 2. It's unclear if the promotion, or a join will succeed, since the
<http://rkecontrolplane.rke.cattle.io|rkecontrolplane.rke.cattle.io>
resource has been deleted 3. If joining another control-plane node, it's unclear (to me) if it should be of version v1.3.1 or v1.3.2 for best chances of success. I had asked some colleagues from the Rancher team about the
<http://rkecontrolplane.rke.cattle.io|rkecontrolplane.rke.cattle.io>
CRD late on Friday, but I haven't received an answer yet.
โค๏ธ 1
m
Hi Moritz, thank you for the update. I really appreciate all the timely feedback and your effort helping us. As I said we have the "luxury" to wait a bit more and as long as you are discussing it internally I can and will resist the urge to make it even worse ;)
Just one small question? What makes you say that the rkecontrolplane has been deleted. As far as I understand it it is still "only" stuck in reconciling but it is still there and I don't see that a deletion has been initiated or maybe I have overlooked that
e
Is it not deleted? I was under the impression that it is. Sorry, this must have been a misunderstanding. From https://github.com/harvester/harvester/issues/7331 the reproduction steps:
3. Check the states of some Custom Ressources like Machines or RKEControlPlanes`and see that they are stuck in a`Provisioning`or`Reconciling` state .
4. Delete the stuck CRs, which triggers the deletion of one of the control-plane nodes.
To me this implied that the
rkecontrolplane
resource was deleted. But you're right, I should have double checked with the support bundle, it's indeed not deleted.
m
Oh sorry I did not realize that I did not specify correctly which CRs I deleted. I will make it clearer in the original Issue ... I was talking about the stuck machines.x-cluster Ressources... not even I would have been foolish enough to delete the whole rkecontrolplane of the Harvester Cluster xD That happended to me once or twice in a downstream cluster, and I learned the hard way that there is no coming back from this (at least I have not found any way)
@enough-australia-5601 Would that change anything in your opinion if the rkecontrolplane was not deleted?
e
Yes, it does indeed. It lowers my worries that joining another control-plane node may fail quite a bit. At the moment, I'm trying to go through that exact scenario in my virtual dev environment. I have set up a 3 control-plane, 2 worker Harvester v1.3.1 cluster and then deleted the
<http://machines.cluster.x-k8s.io|machines.cluster.x-k8s.io>
object belonging to one of the nodes. Then I'm trying to join back a new node in place of the old one. I'm ignoring the upgrade for now to make my test easier to setup. For a worker node joining a deleted machine back has worked flawlessly, but for a control-plane node I haven't seen it work well yet. But my first attempt wasn't clean as my workstation ran out of memory, so I'll try again. One thing I already noticed is that if one of the two remaining control-plane nodes experiences any kind of trouble the cluster pretty much immediately becomes inoperable, since the etcd store loses quorum.
m
Yeah that would make sense... and that is why it is a good thing the the upgrade stalled where it has and has not yet drained and rebooted one of the control-plane nodes because then the etcd would have been inoperable, right?
e
Yes. And if the etcd becomes inoperable the situation will be a lot less enjoyable than what we have right now.
m
Ok, got it. I will you let you test some more. I guess then the main question now would be what has better chances to succeed: add host3 with 1.3.1 or 1.3.2
๐Ÿ‘ 1
Good morning. I wanted to ask if there are any new insights or suggestions? I am starting to receive a little internal pressure to get this issue resolved or at least provide a rough estimate when it might be available.
e
Hi, I tried reproducing your situation in my dev environment, but I couldn't get the exact same failure scenario, so I tried some simplified scenarios to see what should work and what certainly won't. Here are some insights: 1. You can remove nodes from a Harvester cluster by simply deleting the
<http://machines.cluster.x-k8s.io|machines.cluster.x-k8s.io>
resource. Once the cluster has finished reconciling (and the
node
resource is also gone), you can join back a new node under the same name by re-installing on fresh hardware and using the node-join-token. This works both for worker nodes as well as control-plane nodes (tested on v1.3.1 when no upgrade is running though). 2. I also tried deleting the
<http://machine.cluster.x-k8s.io|machine.cluster.x-k8s.io>
resource of a control-plane node during a running upgrade, but I likely did this during a different phase of the upgrade than you. I was able to join the node back using the previously described method of doing a clean install, using the node-join-token to join the node. Once my cluster had 3 control-plane nodes again, the upgrade erred out, but the cluster seemed healthy and I was able to re-start the upgrade. Unfortunately the second attempt at the upgrade didn't succeed (the API server kept crashlooping for ~2h before I pulled the plug on this experiment). During the first upgrade attempt, one of the worker nodes entered a failed state, but I was able to reboot it to get it back to a healthy state. I'm pretty sure this was a resource starvation problem. Ivan and Alejandro suggested when joining the third control-plane, to go directly with v1.3.2. You should be able to fetch the node join token out of
/etc/rancher/rancherd/config.yaml
on one of the existing control-plane nodes. I wish you good luck, since I can't give you a guarantee that this will resolve the cluster's problems.
โค๏ธ 1
m
Thank you. We are now getting ready to try this and get some of the more important test data backed up and than will try to rejoin host 3 with version 1.3.2 I will keep you updated here and in the Github Issue if encounter more problems or if the the plan worked out.
Ok, now apparently we have another problem just getting the installation iso to properly run No matter what network settings we are trying to configure we are getting this error. Any Idea what this new problem is now about. We made sure that the settings are correct and are the same ones we have saved from the node beforehand from the 90_custom.yaml and also harvester.config What is ยดyipยด in the first place and what could be the problem here?
@enough-australia-5601 Do you have any Idea? We have never seen this before (and we have installed from ISO at least 10-15 Times already) and our only idea would be that somehow the installation medium is corrupt yet still booted... We are off to a bad start already ๐Ÿ˜ž
e
yip
is the cloud-init clone that is used by Elemental, which is the base OS installer used in Harvester: https://github.com/rancher/yip But I don't think that is really the root problem here. I'm also assuming that you're using network settings that you already know are good. Are you using a remote-mounted ISO image?
m
We have just retried it with a remote-mounted image an got the same error. We had problems with the remote media in the past, that is why we switched to USB sticks
And we wanted to make sure the the usb-stick was not corrupted and would have asked our admins to flash a new one if the installation via remote media would have progressed further but still the same issue appeared
The network config is correct, we double-checked and also tried other configuration... still always the same error.... the only idea that I just thought about would be maybe before hand wipe the whole installation partition before installing it again. I thought that was not necessary because AFAIK the installation will wipe the partion before installing anyways but maybe we could try that
e
Remote mounted media can show this kind of issue, if it times out.
m
we are currently trying another version of the installation medium remote mounted just to see if the issue persists there...
Ah ok good to know... but this should not happen with an local usb stick right?
e
And no, the installation will not necessarily wipe the partition table. It's optional, and IIRC only some of the more recent installers support it at all.
Local USB sticks should work.
m
I actually would rather not wipe the whole disk becaus we used 2.4 TB of this disk as the default longhorn disk... most our data is on extra disk that are served by a custom storage-class but there might still be data on that partition but we will try this too if it gets the installation rolling
e
If the issue persists with USB sticks, I'd first try to find out what exactly the error is. You can log into the running installer image: https://docs.harvesterhci.io/v1.3/troubleshooting/index/#logging-into-the-harvester-installer-a-live-os Then check the usual places for logs etc. If all else fails, you generate a tarball with debug info with:
Copy code
supportconfig -k -c
in the installer.
You are now trying to install on the same hardware that used to be
host03
, right?
m
I just doublechecked unfortunately we included the default disks into our storage-class. We still can wipe the disk... then the longhorn data has to rebuild the missing replicas, cant be helped...
ah ok thanks for the how-to we will try to find the logs and error... where would the harvester-installation logs be located... we didnt find anything under /var/log/
Copy code
You are now trying to install on the same hardware that used to be host03, right?
Yes it is the exact hardware where the old node was running
It is also still there and even boots successfully but is then running as a single-node
Thats why maybe we should really first delete the data from the installation partition to make sure the old installation is not making any problems... although it should not
The issue persists even with the usb-stick... we are trying to find the logs now and will be creating a supportconfig tarball
๐Ÿ‘ 1
Ah ok... we previously had also one boot medium small disk. and we are now seeing that for whatever reason this disk with this old installation medium is mounted at /run/intitramfs/live ... which should not be the case...
But we are definitivley selecting the usb stick to boot from the BBS menu and in the resulting grub selection we selected harvester 1.3.2
We will try to delete this old installation medium and retry again... that would also explain why this problem persists with different installation media ... but that in and of itself seems to be a bug to me
maybe then the usb stick was not properly flashed and that somehow results in this weird behavior but that still make no sense to us
Weird but that would explain this problem atleast
e
Weird. Is
nvme9n1
an NVMe-oF device or something like that? The Harvester installer can be quite tricky if there are things like that floating around, since it mounts partitions by label. Usually it's not a problem, but in some cases the EFI may expose partitions whose labels match.
m
Yes we think that might have been the problem now and also might have explained other weird phenomena in the past
we will wipe this disk an all other nodes too... now the installation from usb also takes a lot longer and finally failed to boot with this error
because the medium is probably not properly flashed, as we first suspected. but than had the problem that the installation process did find the old disk by label... yeah another weird edge-case... is there an award for that?
Ok now we have finally booted into the right ISO and the installation via remote-medium is running (we had no one onsite to flash a new usb stick and we just hope our vpns will stay up)
e
Crossing my fingers for you ๐Ÿคž
โค๏ธ 1
m
ok, so the upgrade with 1.3.2 on host3 with a remote-mounted medium was successfully and we are now waiting here for like 5 minutes already
I am wondering if it is a bad sign that the node apparently cannot find the harvester vip
e
Do you see the node in the cluster web UI, or is it not there either?
m
ah ok... we should have used the fqdn i guess... but it was the same value we had set before in the original harvester.config
The harvester cluster web gui is still running and we were watching the cluster via kubeconfig closely which is also using the harvester VIP
e
I mean, does the node show up when you do a
kubectl get nodes
?
m
we will try to change the server url in the harvester.config and 90_custom.yaml to match the fqdn
no not yet
e
Looks like going with the FQDN is the way.
๐Ÿ‘ 1
m
rebooting it now
ok it now the rancher-system-agent has started but it is taking a while and has errors
e
I wouldn't be worried about seeing those errors a few times. They basically tell you that something that acts like a K8s client failed to watch for a resource, which can happen for a variety of common reasons and the client should know to to handle it and retry. You should see the node joining the cluster
m
How long should we wait before we see anything inside the harvester cluster? The management url is still "Not ready" as before and the watch on
k get node
and
k get pod -A --field-selector status.phase!=Running
has not shown any change at all and the kubelet throws the same errors every other minute
e
Maybe like five minutes? You did choose to "join an existing Harvester cluster" in the installer, right?
m
You used version 1.3.1 for your tests, right? Maybe that was right after all?
Yes of course ๐Ÿ™‚
And used the old join token, which we verified on one of the other nodes
e
Indeed, I only used v1.3.1
m
Probably because the k8s components on the remaining control-plane nodes were not upgrade and are still running in older versions
I guess this is not the right k8s version for harvester 1.3.2 but 1.3.1 right?
e
Yep.
v1.27.13
is the RKE2 version that powers Harvester v1.3.1
m
Ok we are trying it with 1.3.1 now... because of remote situation problem we wanted to try the netinstall version but had ah dev/kvm error that we have not encountered before and are now trying again with the regular iso but then then installation alone takes almost an hour
e
Yeah, unfortunately the installation is everything but quick.
m
Ok, I guess now we have the same problem as before, that there is yet again another installation medium (the faulty usb stick that does not boot) that is picked up by the installation routine, so we are not sure which version is actually used during installation, and we assume that the
dev/kvm is missing
error is just a symptom of that, like with the
yip version -g
error before- so we just have to wait for some hands on-site to update the usb stick to 1.3.1 and flash it with rufus and we will continue here tomorrow Not great; not terrible ... atleast we have not made things worse yet
Good morning. We have now booted and installed from a bootable USB Stick with Version 1.3.1 but we still are seeing the same error in the racher-system-agent.service
Copy code
Jan 16 09:12:44 haa-devops-harvester02-host03 rancher-system-agent[4234]: W0116 09:12:44.159850   4234 reflector.go:456] pkg/mod/github.com/rancher/client-go@v1.27.4-rancher1/tools/cache/reflector.go:231: watch of *v1.Secret ended with: an error on the server ("unable to decode an event from the watch stream: stream error: stream ID 29; INTERNAL_ERROR; received from peer") has prevented the request from succeeding
That is also why the rancherd is not progressing further because of
In the Harvester cluster itself is also no change whatsoever... no sign that there is even an attempt to join
@enough-australia-5601 Could it be that the cluster still somehow remembers the old node and thats why it is not initiating the join? We have found this in the logs
Jan 16 09:22:09 haa-devops-harvester02-host03 rancherd[7576]: time="2025-01-16T09:22:09Z" level=info msg="[stdout]: [INFO]  Cattle ID was already detected as 429ae3dd34e98159681beb04658a5deda7d408fd1a3c95b1e3924418205c10b. Not generating a new one."
Maybe we need to remove that first?
Ah no, sorry... ok that was from a restart of the service after the rancherd was stuck for more than 15 minutes... in the first run it created a new cattle-id
We are wondering if the missmatch between the rancher version 2.8.3 for 1.3.1 and the 2.8.5 for 1.3.2 could also be a problem? Also we don't really see the resources like secret/harvester-cluster-repo in the harvester cluster
e
Good morning. What does
journalctl -u rke2-server.service
show? Is that unit even running?
m
no it is not even running
My biggest concern is still that the managment-url is never healthy since we deleted host3 and that is probably why we cannot join a new node... and is also probably the reason the original promotion job has not finished
e
Why would the management URL not be healthy? The API service should be redundant between the control-plane nodes and the ingress should switch to one of the remaining nodes as a backend. Can't you reach the Kubernetes API?
m
We can reach the API but the management api is unhealthy...
e
But this is on
host03
. What about the other hosts? There the management API should be healthy.
m
ah ok sorry for the misunderstanding... I have to look in the IPMI for another server...
e
You should be able to
curl
it from
host03
m
yeah but how do i authenticate from it?
I dont have a rke2.yaml yet on that host
e
You don't need to. This shows us that
host03
is able to connect to the API and that the API is there. So something else failed when re-installing
host03
causing it to not be able to join the cluster.
m
Ok but from one of the healthy nodes Servers Dashboard looks like this. What exactly determines if the management-url is ready from a nodes perspective?
BTW we are currently preparing ourselves to just install 1.4.0 on the old host3 and bootstrap a whole new cluster while another team will run their longrunning tests on the stuck cluster and after those have finished just migrate everything over while both clusters are running in parallel
e
Yeah... I was wondering how much pain you're willing to go through with this cluster, especially since I though this was an evaluation cluster, not for production workloads. The dashboard on the console literally just does a
curl
on the management URL and checks if the return code of that process indicates success or not.
m
Well it was almost production-ready... the last thing before our new flag-ship application was supposed to run there was the those tests, that thankfully are still able to run and are producing really promising results... and the upgrade to 1.4.0 xD
And until today I thought a new cluster would mean, that we would have to postpone the go-live... which would have been not a good thing to have report to my higher ups
at least we learned a lot and also know now what NOT to do... so if we get a new cluster running by the end of next week and can migrate the application we should be stillt on track
The dashboard on the console literally just does a curl on the management URL and checks if the return code of that process indicates success or not.
But I still don't understand why is the curl then working on the old control-plane nodes like host4 but not on host3 that wants to join
e
I just skimmed the code real quick. The
curl
is the last step in a series of checks. These checks are all looking for objects in the K8s API. These checks all work together to make sure that the status is displayed correctly whether you're looking at the dashboard on the first node of a cluster or whether you're looking at the dashboard of a worker node, etc. Not sure why there needs to be another
curl
request at the end right now. But if any of these checks fail for any reason, the dashboard will not show the cluster as "Ready". So in a way it's showing the correct info.
host03
isn't ready because it's not properly joined in the cluster, but the other are all showing ready because the cluster is essentially still operating.
m
yes that makes sense... but how would a new node check on k8s ressources via the api when it cannot authenticate. this is still a mistery to me
e
It just won't be ready until it has finished joining the cluster. Part of the joining process is to configure these authentication credentials.
m
ah ok got it
We were for the first time able to iPXE boot now into harvester 1.4.0 installation via netboot.xyz ... so everything looks promising that we can just setup the new cluster alongside the current cluster, at our own pace and completely remote....also the Product teams test was also successfully already and I think we will just try moving forward and try to learn from our mistakes. I am a little disappointed, that we were not able to rescue the cluster, because it feels like this state should be recoverable, but it was mostly my fault and maybe there were already some underlying problems in our cluster that contributed. A fresh install in with the newest version give me a lot more confidence that our applications will run on top without major problems and that is that. I appreciate all the help from everybody and especially @millions-microphone-3535 and @enough-australia-5601 I will update the GH Issue when everything is resolved and try to summarize our findings when I have some more time Best regards and thank you
๐Ÿ‘ 1