This message was deleted Rancher Users #harvester

Join Slack

This message was deleted.

# harvester

adamant-kite-43734

04/27/2023, 5:55 PM

This message was deleted.

better-garage-30620

04/27/2023, 8:49 PM

(3xHP DL380g8) i shut everything down first. it took 2 tries. the first try failed after ~15 minutes. the second try worked after about an hour and a half.

acoustic-addition-45641

04/27/2023, 8:57 PM

Thanks! I shut down all VMs first and verified NTP. No failures yet. Looking at the underlying Rancher dashboard, the upgrade manifest shows 7 hours and still running. One concerning item is that the referring upgrade shows "Activating": But the same upgrade YAML says that there is an error but cannot provide details. I'm not sure how to back out and start over.

prehistoric-balloon-31801

04/28/2023, 3:56 AM

log of

apply-manifests

job shows what it’s waiting for, pease see https://docs.harvesterhci.io/v1.1/upgrade/troubleshooting#phase-3-upgrade-system-services

sticky-summer-13450

04/28/2023, 12:36 PM

What does one do when the log of the

apply-manifests

job shows that it’s waiting for this:

Copy code

Waiting for CAPI cluster fleet-local/local to be provisioned (current phase: Provisioning, current generation: 225721)...

as it has for the last 15 hours?

acoustic-addition-45641

04/28/2023, 12:59 PM

Thanks for the link @prehistoric-balloon-31801. For my instance, it appears to be waiting on the monitoring chart update. The log is full of this: ...Current version: 100.1.0+up19.0.3, Current state: Modified, Current generation: 5 Sleep for 5 seconds to retry Current version: 100.1.0+up19.0.3, Current state: Modified, Current generation: 5 Sleep for 5 seconds to retry Current version: 100.1.0+up19.0.3, Current state: Modified, Current generation: 5... And, as it has been in this state for 23 hours, I'll see if deleting the upgrade CR will put the cluster in a better place.

acoustic-addition-45641

04/28/2023, 1:25 PM

Deleting the upgrade CR allowed me to bring the cluster back online. Again, thanks for the link. I'll try the update again at a future date (or maybe just wait for 1.2.0 release as it is targeted for the end of next month or early June).

witty-jelly-95845

05/06/2023, 6:35 PM

@sticky-summer-13450 my two-node cluster is doing the same trying to upgrade from 1.1.1 to 1.1.2 - I started this the day 1.1.2 was released then forgot about it!

witty-jelly-95845

05/06/2023, 7:04 PM

I've now logged https://github.com/harvester/harvester/issues/3863 for 1.1.1 to 1.1.2 upgrade stuck looping "Waiting for CAPI cluster fleet-local/local to be provisioned"

sticky-summer-13450

05/06/2023, 7:44 PM

@witty-jelly-95845 my three-node cluster is still in the same state. I’ve restarted the individual nodes (having set them to maintenance mode first). I’ve tried following the troubleshooting page to restart the update but deleting the

<http://upgrade.harvesterhci.io/hvst-upgrade-*|upgrade.harvesterhci.io/hvst-upgrade-*>

failes with the error

Error from server (BadRequest): admission webhook "<http://validator.harvesterhci.io|validator.harvesterhci.io>" denied the request: cluster fleet-local/local status is provisioning, please wait for it to be provisioned

prehistoric-balloon-31801

05/08/2023, 12:52 AM

@witty-jelly-95845 I’ll check your support bundle today. @sticky-summer-13450 Do you have a support bundle?

prehistoric-balloon-31801

05/08/2023, 12:53 AM

If the cluster is stuck with cluster provisioning, it generally means it’s waiting for the cluster to be stable. Probably due to missing nodes or previous un-finish upgrades.

sticky-summer-13450

05/08/2023, 4:57 PM

@prehistoric-balloon-31801 A support bundle has been created and uploaded to DropBox, and linked to from Simon's ticket above.

prehistoric-balloon-31801

05/09/2023, 3:28 AM

@sticky-summer-13450, both you and Simon hit the same issue. Rancher is waiting for kube-controller-manager and kube-scheduler to be ready. https://github.com/harvester/harvester/issues/3863#issuecomment-1537668466 Simon’s is waiting on node

dobby

Mark’s is waiting on node

harvester003

prehistoric-balloon-31801

05/09/2023, 3:37 AM

From the Kubernetes perspective, those pods are running. Can you first help me check if they are running at the node level? Run this on the node I mentioned above:

Copy code

crictl ps | grep kube-controller-manager
crictl ps | grep kube-scheduler

prehistoric-balloon-31801

05/09/2023, 3:46 AM

Update on the ticket. This should be cert expiration: https://github.com/harvester/harvester/issues/3863#issuecomment-1539340681 Doe your clusters run for 1+ year?

sticky-summer-13450

05/09/2023, 7:38 AM

For me on the node harvester003 (which I rebooted nearly three days ago) both of those are running:

Copy code

rancher@harvester003:~> uptime
 07:34:18  up 2 days 16:38,  1 user,  load average: 1.96, 1.87, 1.63

rancher@harvester003:~> sudo /var/lib/rancher/rke2/bin/crictl -r unix:///var/run/k3s/containerd/containerd.sock ps | grep kube-controller-manager
28e5a38a53d92       38df782a74380       2 days ago          Running             kube-controller-manager         6                   0608ea1d9c747       kube-controller-manager-harvester003

rancher@harvester003:~> sudo /var/lib/rancher/rke2/bin/crictl -r unix:///var/run/k3s/containerd/containerd.sock ps | grep kube-scheduler
2279ccabaca22       38df782a74380       2 days ago          Running             kube-scheduler                  4                   fe906f7163aa3       kube-scheduler-harvester003

sticky-summer-13450

05/09/2023, 8:02 AM

Yes, the Harvester cluster has been up for more than a year - it was

Created: *471 days ago*

Running the suggested test:

Copy code

(
curl  --cacert /var/lib/rancher/rke2/server/tls/kube-controller-manager/kube-controller-manager.crt \
  <https://127.0.0.1:10257/healthz> >/dev/null 2>&1 \
  && echo "[OK] Kube Controller probe" \
  || echo "[FAIL] Kube Controller probe";

curl --cacert /var/lib/rancher/rke2/server/tls/kube-scheduler/kube-scheduler.crt \
  <https://127.0.0.1:10259/healthz> >/dev/null 2>&1  \
  && echo "[OK] Scheduler probe" \
  || echo "[FAIL] Scheduler probe";
)

results in:

Copy code

[FAIL] Kube Controller probe
[FAIL] Scheduler probe

So I have run something similar to the suggested fix (I need sudo and the full path to

crictl

and the full path to the unix socket):

Copy code

echo "Rotating kube-controller-manager certificate"
sudo rm /var/lib/rancher/rke2/server/tls/kube-controller-manager/kube-controller-manager.{crt,key}
sudo /var/lib/rancher/rke2/bin/crictl -r unix:///var/run/k3s/containerd/containerd.sock rm -f $(sudo /var/lib/rancher/rke2/bin/crictl -r unix:///var/run/k3s/containerd/containerd.sock ps -q --name kube-controller-manager)


echo "Rotating kube-scheduler certificate"
sudo rm /var/lib/rancher/rke2/server/tls/kube-scheduler/kube-scheduler.{crt,key}
sudo /var/lib/rancher/rke2/bin/crictl -r unix:///var/run/k3s/containerd/containerd.sock rm -f $(sudo /var/lib/rancher/rke2/bin/crictl -r unix:///var/run/k3s/containerd/containerd.sock ps -q --name kube-scheduler)

sticky-summer-13450

05/09/2023, 8:27 AM

And the upgrade seems to have unstuck itself. Thanks - I'll add the above to the ticket. Hopefully the same will happen for @witty-jelly-95845

✅ 1

🙌 1

sticky-summer-13450

05/09/2023, 10:09 AM

After this I had to reboot one of the nodes which appeared to be stuck at "pre-drained" - which seems to be quite usual for this cluster. But when the node came back, the update process on that node quickly moved into "post-drained" and the upgrade moved on, and the cluster is fully updated now - thanks 🙂

witty-jelly-95845

05/09/2023, 10:50 AM

Snap! My cluster is now upgrading itself after I fixed the certificates - had a bit of a moment when nothing seemed to be happening but now progressing 😄

🎉 2

111 Views

Open in Slack

Previous Next