adamant-kite-43734
01/08/2025, 7:27 PMacoustic-addition-45641
01/08/2025, 7:39 PMminiature-lock-53926
01/08/2025, 7:42 PMminiature-lock-53926
01/08/2025, 7:43 PMminiature-lock-53926
01/08/2025, 7:45 PMminiature-lock-53926
01/08/2025, 7:50 PMacoustic-addition-45641
01/08/2025, 7:50 PMminiature-lock-53926
01/08/2025, 7:52 PMhappy-cat-90847
01/09/2025, 12:56 PMminiature-lock-53926
01/09/2025, 12:58 PMminiature-lock-53926
01/09/2025, 12:59 PMminiature-lock-53926
01/09/2025, 1:00 PMhappy-cat-90847
01/09/2025, 1:20 PMminiature-lock-53926
01/09/2025, 1:24 PMhappy-cat-90847
01/09/2025, 1:46 PMhappy-cat-90847
01/09/2025, 1:47 PMminiature-lock-53926
01/09/2025, 1:49 PMminiature-lock-53926
01/09/2025, 1:49 PMhappy-cat-90847
01/09/2025, 1:50 PMminiature-lock-53926
01/09/2025, 1:50 PMhappy-cat-90847
01/09/2025, 1:50 PMminiature-lock-53926
01/09/2025, 1:51 PMminiature-lock-53926
01/09/2025, 1:57 PMenough-australia-5601
01/09/2025, 2:27 PMhaa-devops-harvester02-node02
and haa-devops-harvester02-node05
cordoned? Was this part of the upgrade or was this done manually?miniature-lock-53926
01/09/2025, 2:28 PMminiature-lock-53926
01/09/2025, 2:31 PMsalmon-city-57654
01/09/2025, 3:16 PMsalmon-city-57654
01/09/2025, 3:16 PMminiature-lock-53926
01/09/2025, 3:17 PMminiature-lock-53926
01/09/2025, 3:17 PMsalmon-city-57654
01/09/2025, 3:18 PMsalmon-city-57654
01/09/2025, 3:20 PMminiature-lock-53926
01/09/2025, 3:20 PMminiature-lock-53926
01/09/2025, 3:20 PMsalmon-city-57654
01/09/2025, 3:21 PMminiature-lock-53926
01/09/2025, 3:22 PMminiature-lock-53926
01/09/2025, 3:23 PMsalmon-city-57654
01/09/2025, 3:24 PMsalmon-city-57654
01/09/2025, 3:24 PMminiature-lock-53926
01/09/2025, 3:31 PMsalmon-city-57654
01/09/2025, 3:33 PMminiature-lock-53926
01/09/2025, 3:33 PMsalmon-city-57654
01/09/2025, 3:54 PMsalmon-city-57654
01/09/2025, 3:54 PMminiature-lock-53926
01/09/2025, 3:57 PMminiature-lock-53926
01/09/2025, 3:58 PMminiature-lock-53926
01/09/2025, 3:58 PMsalmon-city-57654
01/09/2025, 4:01 PMsalmon-city-57654
01/09/2025, 4:02 PMminiature-lock-53926
01/09/2025, 4:02 PMminiature-lock-53926
01/09/2025, 4:02 PMminiature-lock-53926
01/09/2025, 4:04 PMsalmon-city-57654
01/09/2025, 4:05 PMminiature-lock-53926
01/09/2025, 4:06 PMminiature-lock-53926
01/09/2025, 4:08 PMsalmon-city-57654
01/09/2025, 4:09 PMsalmon-city-57654
01/09/2025, 4:10 PMminiature-lock-53926
01/09/2025, 4:12 PMminiature-lock-53926
01/09/2025, 4:14 PMminiature-lock-53926
01/09/2025, 4:19 PMsalmon-city-57654
01/09/2025, 4:19 PMminiature-lock-53926
01/09/2025, 4:22 PMsalmon-city-57654
01/09/2025, 4:23 PMThe running promote pods had specific errorsDid you mean you have logs of promote pod?
salmon-city-57654
01/09/2025, 4:25 PMminiature-lock-53926
01/09/2025, 4:27 PMsalmon-city-57654
01/09/2025, 4:30 PMsalmon-city-57654
01/09/2025, 4:30 PMsalmon-city-57654
01/09/2025, 4:31 PMminiature-lock-53926
01/09/2025, 4:31 PMminiature-lock-53926
01/09/2025, 4:32 PMDid you manually change anything on Node CR of node5?
not that I remembersalmon-city-57654
01/09/2025, 4:33 PMWaiting for promotion...
That means everything should be settled. We just wait for the status change.miniature-lock-53926
01/09/2025, 4:35 PMsalmon-city-57654
01/09/2025, 4:37 PMminiature-lock-53926
01/09/2025, 4:46 PMsalmon-city-57654
01/09/2025, 4:48 PMacoustic-addition-45641
01/09/2025, 4:52 PMsalmon-city-57654
01/09/2025, 5:00 PMsalmon-city-57654
01/09/2025, 5:00 PMminiature-lock-53926
01/09/2025, 5:01 PMenough-australia-5601
01/09/2025, 5:05 PMespecially because after accidentally triggering the deletion of one of the control-plane nodesDeleting a single control-plane node in a cluster with three control planes is a really awkward edge case. etcd requires the majority of control-plane nodes to be in good working order to maintain quorum and accept writes, otherwise it will fall back into a read-only mode to preserve data consistency and avoid a split-brain scenario. The minimum number of nodes required for etcd can not be changed willy-nilly, usually it can only be done by backing up the data and restoring into a completely fresh etcd instance. With one out of three control-plane nodes gone, this puts the etcd in a weird spot, where it requires both other nodes to still be there to maintain the quorum. I found an older Rancher issue about promoting RKE2 workers to master nodes, but none of what's discussed there seems to be out of whack in this case. https://github.com/rancher/rancher/issues/36480#issuecomment-1039253499 I'll need some more time to look for a smoking gun and check back with Vincente before I can advise about what to do next
millions-microphone-3535
01/09/2025, 10:56 PMhost05
was scheduled for node promotion, but what i don't get is why (from the kube-controller-manager logs) the promotion job was re-enqueued multiple times (within a 5-minute timeframe) even after a promotion job pod was startedmillions-microphone-3535
01/09/2025, 10:56 PMmillions-microphone-3535
01/09/2025, 10:57 PMmillions-microphone-3535
01/09/2025, 11:00 PMharvester-system
namespace and should be named with the prefix harvester-promote-haa-devops-harvester02-host05-
millions-microphone-3535
01/09/2025, 11:02 PMminiature-lock-53926
01/10/2025, 10:37 AMminiature-lock-53926
01/10/2025, 10:39 AMminiature-lock-53926
01/10/2025, 10:44 AMkubectl delete upgrade -n cattle-system hvst-upgrade-78k2f-prepare
?enough-australia-5601
01/10/2025, 12:19 PMhost05
is cordoned.
It doesn't seem like it's being drained, but I think this may pose a problem with scheduling the required pods to complete the promotion, because while they may tolerate taints with NoExecute
effect, they may not tolerate a taint with NoSchedule
effect:
โ ~/Downloads/stuck_upgrade/supportbundle_207a51d7-61ff-4f36-8785-38454b6ce253_2025-01-09T15-25-27Z โ 130 โบ yq '.items[] | {"name": .metadata.name, "taints": .spec.taints}' yamls/cluster/v1/nodes.yaml
name: haa-devops-harvester02-host01
taints: null
name: haa-devops-harvester02-host02
taints:
- effect: NoSchedule
key: <http://kubevirt.io/drain|kubevirt.io/drain>
value: draining
- effect: NoSchedule
key: <http://node.kubernetes.io/unschedulable|node.kubernetes.io/unschedulable>
timeAdded: "2025-01-08T21:14:36Z"
name: haa-devops-harvester02-host04
taints: null
name: haa-devops-harvester02-host05
taints:
- effect: NoSchedule
key: <http://node.kubernetes.io/unschedulable|node.kubernetes.io/unschedulable>
timeAdded: "2025-01-08T21:10:18Z"
The promotion process specifically looks for a taint with NoSchedule
effect, but with a different key:
# make sure we should not have any related label/taint on the node
if [[ $ETCD_ONLY == false ]]; then
found=$($KUBECTL get node $HOSTNAME -o yaml | $YQ '.spec.taints[] | select (.effect == "NoSchedule" and .key == "<http://node-role.kubernetes.io/etcd=true|node-role.kubernetes.io/etcd=true>") | .effect')
if [[ -n $found ]]
then
$KUBECTL taint nodes $HOSTNAME <http://node-role.kubernetes.io/etcd=true:NoExecute-|node-role.kubernetes.io/etcd=true:NoExecute->
fi
$KUBECTL label --overwrite nodes $HOSTNAME <http://node-role.harvesterhci.io/witness-|node-role.harvesterhci.io/witness->
fi
While the etcd
pods have a toleration for the NoExecute
taint, they don't have one for the NoSchedule
taint, which is why I think they won't start on a cordoned node and as a result, a cordoned node won't successfully get promoted.
There is also a Harvester and a Harvester webhook pod which can't be scheduled ever since host05
was cordoned:
...
status:
conditions:
- lastProbeTime: "null"
lastTransitionTime: "2025-01-08T21:10:18Z"
message: '0/4 nodes are available: 2 node(s) didn''t match pod anti-affinity
rules, 2 node(s) were unschedulable. preemption: 0/4 nodes are available:
2 No preemption victims found for incoming pod, 2 Preemption is not helpful
for scheduling..'
reason: Unschedulable
status: "False"
type: PodScheduled
miniature-lock-53926
01/10/2025, 12:28 PMminiature-lock-53926
01/10/2025, 12:30 PMenough-australia-5601
01/10/2025, 12:32 PMharvester-system/harvester-helpers
enough-australia-5601
01/10/2025, 12:34 PMminiature-lock-53926
01/10/2025, 12:34 PMminiature-lock-53926
01/10/2025, 12:35 PMenough-australia-5601
01/10/2025, 12:42 PMhost05
prevents the pods that make up a Harvester master node from being scheduled. As a result, the node never finishes the promotion. That's why I want to know why the taint was put there in the first place, because then I can maybe tell if it's safe to remove, which would perhaps unblock the promotion.
Right now none of my colleagues are online, but later Ivan will be. I'll ask him what he thinks of this.miniature-lock-53926
01/10/2025, 12:49 PMmillions-microphone-3535
01/10/2025, 6:16 PMminiature-lock-53926
01/10/2025, 6:19 PMminiature-lock-53926
01/10/2025, 6:20 PMkubectl delete upgrade -n cattle-system hvst-upgrade-78k2f-prepare
?millions-microphone-3535
01/10/2025, 6:32 PMmillions-microphone-3535
01/10/2025, 6:34 PMminiature-lock-53926
01/10/2025, 6:36 PMminiature-lock-53926
01/10/2025, 6:38 PMmillions-microphone-3535
01/10/2025, 6:43 PMminiature-lock-53926
01/10/2025, 6:43 PMmillions-microphone-3535
01/10/2025, 6:43 PMminiature-lock-53926
01/10/2025, 6:44 PMminiature-lock-53926
01/10/2025, 7:04 PMminiature-lock-53926
01/10/2025, 7:08 PMUpgrading System Services
because I manually uncordoned one of the nodes and that just for a couple of seconds let the 3 pending pods schedule successfully and that is when the Progessbar jumped to 100%
The mcc-harvester bundle is still not finished though, and that is why I assume that we may not have really entered Phase4 yet or even if we have, nothing has so far been upgraded on the nodes yetminiature-lock-53926
01/10/2025, 7:10 PMminiature-lock-53926
01/13/2025, 9:24 AMminiature-lock-53926
01/13/2025, 9:40 AMenough-australia-5601
01/13/2025, 10:05 AM<http://rkecontrolplane.rke.cattle.io|rkecontrolplane.rke.cattle.io>
resource has been deleted
3. If joining another control-plane node, it's unclear (to me) if it should be of version v1.3.1 or v1.3.2 for best chances of success.
I had asked some colleagues from the Rancher team about the <http://rkecontrolplane.rke.cattle.io|rkecontrolplane.rke.cattle.io>
CRD late on Friday, but I haven't received an answer yet.miniature-lock-53926
01/13/2025, 10:21 AMminiature-lock-53926
01/13/2025, 10:31 AMenough-australia-5601
01/13/2025, 10:41 AM3. Check the states of some Custom Ressources like Machines or RKEControlPlanes`and see that they are stuck in a`Provisioning`or`Reconciling` state .
4. Delete the stuck CRs, which triggers the deletion of one of the control-plane nodes.To me this implied that the
rkecontrolplane
resource was deleted. But you're right, I should have double checked with the support bundle, it's indeed not deleted.miniature-lock-53926
01/13/2025, 10:46 AMminiature-lock-53926
01/13/2025, 2:27 PMenough-australia-5601
01/13/2025, 2:58 PM<http://machines.cluster.x-k8s.io|machines.cluster.x-k8s.io>
object belonging to one of the nodes. Then I'm trying to join back a new node in place of the old one. I'm ignoring the upgrade for now to make my test easier to setup.
For a worker node joining a deleted machine back has worked flawlessly, but for a control-plane node I haven't seen it work well yet. But my first attempt wasn't clean as my workstation ran out of memory, so I'll try again.
One thing I already noticed is that if one of the two remaining control-plane nodes experiences any kind of trouble the cluster pretty much immediately becomes inoperable, since the etcd store loses quorum.miniature-lock-53926
01/13/2025, 3:02 PMenough-australia-5601
01/13/2025, 3:07 PMminiature-lock-53926
01/13/2025, 3:21 PMminiature-lock-53926
01/14/2025, 8:58 AMenough-australia-5601
01/14/2025, 2:45 PM<http://machines.cluster.x-k8s.io|machines.cluster.x-k8s.io>
resource. Once the cluster has finished reconciling (and the node
resource is also gone), you can join back a new node under the same name by re-installing on fresh hardware and using the node-join-token. This works both for worker nodes as well as control-plane nodes (tested on v1.3.1 when no upgrade is running though).
2. I also tried deleting the <http://machine.cluster.x-k8s.io|machine.cluster.x-k8s.io>
resource of a control-plane node during a running upgrade, but I likely did this during a different phase of the upgrade than you. I was able to join the node back using the previously described method of doing a clean install, using the node-join-token to join the node. Once my cluster had 3 control-plane nodes again, the upgrade erred out, but the cluster seemed healthy and I was able to re-start the upgrade. Unfortunately the second attempt at the upgrade didn't succeed (the API server kept crashlooping for ~2h before I pulled the plug on this experiment). During the first upgrade attempt, one of the worker nodes entered a failed state, but I was able to reboot it to get it back to a healthy state. I'm pretty sure this was a resource starvation problem.
Ivan and Alejandro suggested when joining the third control-plane, to go directly with v1.3.2.
You should be able to fetch the node join token out of /etc/rancher/rancherd/config.yaml
on one of the existing control-plane nodes.
I wish you good luck, since I can't give you a guarantee that this will resolve the cluster's problems.miniature-lock-53926
01/15/2025, 8:46 AMminiature-lock-53926
01/15/2025, 10:23 AMminiature-lock-53926
01/15/2025, 10:37 AMenough-australia-5601
01/15/2025, 10:53 AMyip
is the cloud-init clone that is used by Elemental, which is the base OS installer used in Harvester: https://github.com/rancher/yip
But I don't think that is really the root problem here. I'm also assuming that you're using network settings that you already know are good.
Are you using a remote-mounted ISO image?miniature-lock-53926
01/15/2025, 10:54 AMminiature-lock-53926
01/15/2025, 10:55 AMminiature-lock-53926
01/15/2025, 10:59 AMenough-australia-5601
01/15/2025, 10:59 AMminiature-lock-53926
01/15/2025, 10:59 AMminiature-lock-53926
01/15/2025, 11:00 AMenough-australia-5601
01/15/2025, 11:00 AMenough-australia-5601
01/15/2025, 11:01 AMminiature-lock-53926
01/15/2025, 11:02 AMenough-australia-5601
01/15/2025, 11:06 AMsupportconfig -k -c
in the installer.enough-australia-5601
01/15/2025, 11:07 AMhost03
, right?miniature-lock-53926
01/15/2025, 11:07 AMminiature-lock-53926
01/15/2025, 11:09 AMminiature-lock-53926
01/15/2025, 11:11 AMYou are now trying to install on the same hardware that used to be host03, right?
Yes it is the exact hardware where the old node was runningminiature-lock-53926
01/15/2025, 11:13 AMminiature-lock-53926
01/15/2025, 11:16 AMminiature-lock-53926
01/15/2025, 11:21 AMminiature-lock-53926
01/15/2025, 11:37 AMminiature-lock-53926
01/15/2025, 11:38 AMminiature-lock-53926
01/15/2025, 11:39 AMminiature-lock-53926
01/15/2025, 11:41 AMminiature-lock-53926
01/15/2025, 11:45 AMenough-australia-5601
01/15/2025, 11:53 AMnvme9n1
an NVMe-oF device or something like that?
The Harvester installer can be quite tricky if there are things like that floating around, since it mounts partitions by label. Usually it's not a problem, but in some cases the EFI may expose partitions whose labels match.miniature-lock-53926
01/15/2025, 12:01 PMminiature-lock-53926
01/15/2025, 12:04 PMminiature-lock-53926
01/15/2025, 12:05 PMminiature-lock-53926
01/15/2025, 12:46 PMenough-australia-5601
01/15/2025, 12:51 PMminiature-lock-53926
01/15/2025, 1:30 PMminiature-lock-53926
01/15/2025, 1:31 PMenough-australia-5601
01/15/2025, 1:33 PMminiature-lock-53926
01/15/2025, 1:33 PMminiature-lock-53926
01/15/2025, 1:34 PMenough-australia-5601
01/15/2025, 1:37 PMkubectl get nodes
?miniature-lock-53926
01/15/2025, 1:37 PMminiature-lock-53926
01/15/2025, 1:37 PMenough-australia-5601
01/15/2025, 1:39 PMminiature-lock-53926
01/15/2025, 1:40 PMminiature-lock-53926
01/15/2025, 1:49 PMenough-australia-5601
01/15/2025, 1:58 PMminiature-lock-53926
01/15/2025, 2:00 PMk get node
and k get pod -A --field-selector status.phase!=Running
has not shown any change at all and the kubelet throws the same errors every other minuteenough-australia-5601
01/15/2025, 2:09 PMminiature-lock-53926
01/15/2025, 2:09 PMminiature-lock-53926
01/15/2025, 2:09 PMminiature-lock-53926
01/15/2025, 2:09 PMenough-australia-5601
01/15/2025, 2:10 PMminiature-lock-53926
01/15/2025, 2:13 PMminiature-lock-53926
01/15/2025, 2:16 PMenough-australia-5601
01/15/2025, 2:27 PMv1.27.13
is the RKE2 version that powers Harvester v1.3.1miniature-lock-53926
01/15/2025, 2:52 PMenough-australia-5601
01/15/2025, 3:06 PMminiature-lock-53926
01/15/2025, 3:24 PMdev/kvm is missing
error is just a symptom of that, like with the yip version -g
error before-
so we just have to wait for some hands on-site to update the usb stick to 1.3.1 and flash it with rufus and we will continue here tomorrow
Not great; not terrible ... atleast we have not made things worse yetminiature-lock-53926
01/16/2025, 9:14 AMJan 16 09:12:44 haa-devops-harvester02-host03 rancher-system-agent[4234]: W0116 09:12:44.159850 4234 reflector.go:456] pkg/mod/github.com/rancher/client-go@v1.27.4-rancher1/tools/cache/reflector.go:231: watch of *v1.Secret ended with: an error on the server ("unable to decode an event from the watch stream: stream error: stream ID 29; INTERNAL_ERROR; received from peer") has prevented the request from succeeding
miniature-lock-53926
01/16/2025, 9:15 AMminiature-lock-53926
01/16/2025, 9:22 AMminiature-lock-53926
01/16/2025, 9:29 AMJan 16 09:22:09 haa-devops-harvester02-host03 rancherd[7576]: time="2025-01-16T09:22:09Z" level=info msg="[stdout]: [INFO] Cattle ID was already detected as 429ae3dd34e98159681beb04658a5deda7d408fd1a3c95b1e3924418205c10b. Not generating a new one."
miniature-lock-53926
01/16/2025, 9:29 AMminiature-lock-53926
01/16/2025, 9:40 AMminiature-lock-53926
01/16/2025, 9:44 AMenough-australia-5601
01/16/2025, 10:45 AMjournalctl -u rke2-server.service
show? Is that unit even running?miniature-lock-53926
01/16/2025, 10:57 AMminiature-lock-53926
01/16/2025, 1:05 PMenough-australia-5601
01/16/2025, 1:39 PMminiature-lock-53926
01/16/2025, 1:48 PMenough-australia-5601
01/16/2025, 2:01 PMhost03
. What about the other hosts? There the management API should be healthy.miniature-lock-53926
01/16/2025, 2:02 PMenough-australia-5601
01/16/2025, 2:16 PMcurl
it from host03
miniature-lock-53926
01/16/2025, 2:17 PMminiature-lock-53926
01/16/2025, 2:17 PMenough-australia-5601
01/16/2025, 2:20 PMhost03
is able to connect to the API and that the API is there.
So something else failed when re-installing host03
causing it to not be able to join the cluster.miniature-lock-53926
01/16/2025, 2:42 PMminiature-lock-53926
01/16/2025, 2:48 PMenough-australia-5601
01/16/2025, 2:58 PMcurl
on the management URL and checks if the return code of that process indicates success or not.miniature-lock-53926
01/16/2025, 2:59 PMminiature-lock-53926
01/16/2025, 3:00 PMminiature-lock-53926
01/16/2025, 3:11 PMminiature-lock-53926
01/16/2025, 3:15 PMThe dashboard on the console literally just does a curl on the management URL and checks if the return code of that process indicates success or not.
But I still don't understand why is the curl then working on the old control-plane nodes like host4 but not on host3 that wants to joinenough-australia-5601
01/16/2025, 3:26 PMcurl
is the last step in a series of checks. These checks are all looking for objects in the K8s API. These checks all work together to make sure that the status is displayed correctly whether you're looking at the dashboard on the first node of a cluster or whether you're looking at the dashboard of a worker node, etc.
Not sure why there needs to be another curl
request at the end right now.
But if any of these checks fail for any reason, the dashboard will not show the cluster as "Ready".
So in a way it's showing the correct info. host03
isn't ready because it's not properly joined in the cluster, but the other are all showing ready because the cluster is essentially still operating.miniature-lock-53926
01/16/2025, 4:05 PMenough-australia-5601
01/16/2025, 4:21 PMminiature-lock-53926
01/16/2025, 4:22 PMminiature-lock-53926
01/16/2025, 4:57 PM