https://rancher.com/ logo
Title
a

acoustic-addition-45641

04/27/2023, 5:55 PM
Upgrading from v1.1.1 to v1.1.2, and my Harvester cluster (3xDell R740XDs) has been in this state for a few hours. For anyone who has updated to v1.1.2, how long did your update process take?
b

better-garage-30620

04/27/2023, 8:49 PM
(3xHP DL380g8) i shut everything down first. it took 2 tries. the first try failed after ~15 minutes. the second try worked after about an hour and a half.
a

acoustic-addition-45641

04/27/2023, 8:57 PM
Thanks! I shut down all VMs first and verified NTP. No failures yet. Looking at the underlying Rancher dashboard, the upgrade manifest shows 7 hours and still running. One concerning item is that the referring upgrade shows "Activating": But the same upgrade YAML says that there is an error but cannot provide details. I'm not sure how to back out and start over.
p

prehistoric-balloon-31801

04/28/2023, 3:56 AM
log of
apply-manifests
job shows what it’s waiting for, pease see https://docs.harvesterhci.io/v1.1/upgrade/troubleshooting#phase-3-upgrade-system-services
s

sticky-summer-13450

04/28/2023, 12:36 PM
What does one do when the log of the
apply-manifests
job shows that it’s waiting for this:
Waiting for CAPI cluster fleet-local/local to be provisioned (current phase: Provisioning, current generation: 225721)...
as it has for the last 15 hours?
a

acoustic-addition-45641

04/28/2023, 12:59 PM
Thanks for the link @prehistoric-balloon-31801. For my instance, it appears to be waiting on the monitoring chart update. The log is full of this: ...Current version: 100.1.0+up19.0.3, Current state: Modified, Current generation: 5 Sleep for 5 seconds to retry Current version: 100.1.0+up19.0.3, Current state: Modified, Current generation: 5 Sleep for 5 seconds to retry Current version: 100.1.0+up19.0.3, Current state: Modified, Current generation: 5... And, as it has been in this state for 23 hours, I'll see if deleting the upgrade CR will put the cluster in a better place.
Deleting the upgrade CR allowed me to bring the cluster back online. Again, thanks for the link. I'll try the update again at a future date (or maybe just wait for 1.2.0 release as it is targeted for the end of next month or early June).
w

witty-jelly-95845

05/06/2023, 6:35 PM
@sticky-summer-13450 my two-node cluster is doing the same trying to upgrade from 1.1.1 to 1.1.2 - I started this the day 1.1.2 was released then forgot about it!
I've now logged https://github.com/harvester/harvester/issues/3863 for 1.1.1 to 1.1.2 upgrade stuck looping "Waiting for CAPI cluster fleet-local/local to be provisioned"
s

sticky-summer-13450

05/06/2023, 7:44 PM
@witty-jelly-95845 my three-node cluster is still in the same state. I’ve restarted the individual nodes (having set them to maintenance mode first). I’ve tried following the troubleshooting page to restart the update but deleting the
<http://upgrade.harvesterhci.io/hvst-upgrade-*|upgrade.harvesterhci.io/hvst-upgrade-*>
failes with the error
Error from server (BadRequest): admission webhook "<http://validator.harvesterhci.io|validator.harvesterhci.io>" denied the request: cluster fleet-local/local status is provisioning, please wait for it to be provisioned
.
p

prehistoric-balloon-31801

05/08/2023, 12:52 AM
@witty-jelly-95845 I’ll check your support bundle today. @sticky-summer-13450 Do you have a support bundle?
If the cluster is stuck with cluster provisioning, it generally means it’s waiting for the cluster to be stable. Probably due to missing nodes or previous un-finish upgrades.
s

sticky-summer-13450

05/08/2023, 4:57 PM
@prehistoric-balloon-31801 A support bundle has been created and uploaded to DropBox, and linked to from Simon's ticket above.
p

prehistoric-balloon-31801

05/09/2023, 3:28 AM
@sticky-summer-13450, both you and Simon hit the same issue. Rancher is waiting for kube-controller-manager and kube-scheduler to be ready. https://github.com/harvester/harvester/issues/3863#issuecomment-1537668466 Simon’s is waiting on node
dobby
Mark’s is waiting on node
harvester003
From the Kubernetes perspective, those pods are running. Can you first help me check if they are running at the node level? Run this on the node I mentioned above:
crictl ps | grep kube-controller-manager
crictl ps | grep kube-scheduler
Update on the ticket. This should be cert expiration: https://github.com/harvester/harvester/issues/3863#issuecomment-1539340681 Doe your clusters run for 1+ year?
s

sticky-summer-13450

05/09/2023, 7:38 AM
For me on the node harvester003 (which I rebooted nearly three days ago) both of those are running:
rancher@harvester003:~> uptime
 07:34:18  up 2 days 16:38,  1 user,  load average: 1.96, 1.87, 1.63

rancher@harvester003:~> sudo /var/lib/rancher/rke2/bin/crictl -r unix:///var/run/k3s/containerd/containerd.sock ps | grep kube-controller-manager
28e5a38a53d92       38df782a74380       2 days ago          Running             kube-controller-manager         6                   0608ea1d9c747       kube-controller-manager-harvester003

rancher@harvester003:~> sudo /var/lib/rancher/rke2/bin/crictl -r unix:///var/run/k3s/containerd/containerd.sock ps | grep kube-scheduler
2279ccabaca22       38df782a74380       2 days ago          Running             kube-scheduler                  4                   fe906f7163aa3       kube-scheduler-harvester003
Yes, the Harvester cluster has been up for more than a year - it was
Created: *471 days ago*
Running the suggested test:
(
curl  --cacert /var/lib/rancher/rke2/server/tls/kube-controller-manager/kube-controller-manager.crt \
  <https://127.0.0.1:10257/healthz> >/dev/null 2>&1 \
  && echo "[OK] Kube Controller probe" \
  || echo "[FAIL] Kube Controller probe";

curl --cacert /var/lib/rancher/rke2/server/tls/kube-scheduler/kube-scheduler.crt \
  <https://127.0.0.1:10259/healthz> >/dev/null 2>&1  \
  && echo "[OK] Scheduler probe" \
  || echo "[FAIL] Scheduler probe";
)
results in:
[FAIL] Kube Controller probe
[FAIL] Scheduler probe
So I have run something similar to the suggested fix (I need sudo and the full path to
crictl
and the full path to the unix socket):
echo "Rotating kube-controller-manager certificate"
sudo rm /var/lib/rancher/rke2/server/tls/kube-controller-manager/kube-controller-manager.{crt,key}
sudo /var/lib/rancher/rke2/bin/crictl -r unix:///var/run/k3s/containerd/containerd.sock rm -f $(sudo /var/lib/rancher/rke2/bin/crictl -r unix:///var/run/k3s/containerd/containerd.sock ps -q --name kube-controller-manager)


echo "Rotating kube-scheduler certificate"
sudo rm /var/lib/rancher/rke2/server/tls/kube-scheduler/kube-scheduler.{crt,key}
sudo /var/lib/rancher/rke2/bin/crictl -r unix:///var/run/k3s/containerd/containerd.sock rm -f $(sudo /var/lib/rancher/rke2/bin/crictl -r unix:///var/run/k3s/containerd/containerd.sock ps -q --name kube-scheduler)
And the upgrade seems to have unstuck itself. Thanks - I'll add the above to the ticket. Hopefully the same will happen for @witty-jelly-95845
1
🙌 1
After this I had to reboot one of the nodes which appeared to be stuck at "pre-drained" - which seems to be quite usual for this cluster. But when the node came back, the update process on that node quickly moved into "post-drained" and the upgrade moved on, and the cluster is fully updated now - thanks 🙂
w

witty-jelly-95845

05/09/2023, 10:50 AM
Snap! My cluster is now upgrading itself after I fixed the certificates - had a bit of a moment when nothing seemed to be happening but now progressing 😄
🎉 2