This message was deleted Rancher Users #harvester

Join Slack

This message was deleted.

# harvester

adamant-kite-43734

08/03/2024, 1:29 PM

This message was deleted.

hvst-upgrade-ns4p2-upgradelog-archive-2024-08-03T06-59-28Z.zip

worried-state-78253

08/04/2024, 1:42 PM

I’ve had to start up the VM’s on this cluster, there is a k8s cluster on this too - that has now failed, I’m guessing as the infrastructure is currently a bit broken being mid upgrade. I could afford to loose the k8s cluster - so decided to delete it, stopped all the VM’s to see if this would get things moving…. Now stuck with the cluster nodes reporting -

Copy code

Failed deleting server [fleet-default/web-engine-1-pool1-213e303e-swtv9] of kind (HarvesterMachine) for machine web-engine-1-pool1-5898cb4dcdxbfnxd-xkhrf in infrastructure provider: DeleteError: Downloading driver from <https://rancher.web-engineer/assets/docker-machine-driver-harvester> Doing /etc/rancher/ssl docker-machine-driver-harvester docker-machine-driver-harvester: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, stripped About to remove web-engine-1-pool1-213e303e-swtv9 WARNING: This action will delete both local reference and remote instance. Error removing host "web-engine-1-pool1-213e303e-swtv9": the server has asked for the client to provide credentials (get <http://virtualmachines.kubevirt.io|virtualmachines.kubevirt.io> web-engine-1-pool1-213e303e-swtv9)

Upgrade still stalled… Hoping to resolve this before the week starts - thinking of rebooting the harvester nodes next, going try looking at these logs more closely first - not sure the best way forward.

worried-state-78253

08/04/2024, 1:47 PM

Just trying this - https://docs.harvesterhci.io/v1.2/upgrade/v1-2-1-to-v1-2-2/

worried-state-78253

08/04/2024, 2:02 PM

Following - Upgrade stuck in the

Upgrading System Service

state, from that doc - there is no node showing a certificate issue -

Copy code

➜  Documents kubectl get <http://clusters.provisioning.cattle.io|clusters.provisioning.cattle.io> local -n fleet-local -o yaml
apiVersion: <http://provisioning.cattle.io/v1|provisioning.cattle.io/v1>
kind: Cluster
metadata:
  annotations:
    <http://kubectl.kubernetes.io/last-applied-configuration|kubectl.kubernetes.io/last-applied-configuration>: |
      {"apiVersion":"<http://provisioning.cattle.io/v1|provisioning.cattle.io/v1>","kind":"Cluster","metadata":{"annotations":{},"name":"local","namespace":"fleet-local"},"spec":{"kubernetesVersion":"v1.25.9+rke2r1","rkeConfig":{"controlPlaneConfig":{"disable":["rke2-snapshot-controller","rke2-snapshot-controller-crd","rke2-snapshot-validation-webhook"]}}}}
    <http://objectset.rio.cattle.io/applied|objectset.rio.cattle.io/applied>: H4sIAAAAAAAA/4yQzU7DMBCEXwXt2Slt079Y4oAQ4sCVF9jYS2Ow15G9CYfK746SVqJC4udo78xovjlBIEGLgqBPgMxRUFzkPD1j+0ZGMskiubgwKOJp4eKts6ChT3F02UV2fKyMH7JQqkwiFAL1ozV+MKXqOL6DhoCMRwrEciUYa3Xz7NjePZwj/8xiDAQafDTo/yXOPZrJAUXB3NdFfnGBsmDoQfPgvQKPLflfR+gwd6Bhu9ztt3XdUGNwc7Crdr9u6jW1y/pg91vb2LXdbHarA6jzYpbSVwho6DCNNIMWBd9Yrtu+eiKpzpeiIPdkpnbzx2Wq+0G6R7Z9dCygT2WSCcpwwciURrJPxJRmZtDLUj4DAAD//5CVWGcAAgAA
    <http://objectset.rio.cattle.io/id|objectset.rio.cattle.io/id>: provisioning-cluster-create
    <http://objectset.rio.cattle.io/owner-gvk|objectset.rio.cattle.io/owner-gvk>: <http://management.cattle.io/v3|management.cattle.io/v3>, Kind=Cluster
    <http://objectset.rio.cattle.io/owner-name|objectset.rio.cattle.io/owner-name>: local
    <http://objectset.rio.cattle.io/owner-namespace|objectset.rio.cattle.io/owner-namespace>: ""
  creationTimestamp: "2023-10-27T14:18:54Z"
  finalizers:
  - <http://wrangler.cattle.io/provisioning-cluster-remove|wrangler.cattle.io/provisioning-cluster-remove>
  - <http://wrangler.cattle.io/rke-cluster-remove|wrangler.cattle.io/rke-cluster-remove>
  generation: 4
  labels:
    <http://objectset.rio.cattle.io/hash|objectset.rio.cattle.io/hash>: 50675339e9ca48d1b72932eb038d75d9d2d44618
    <http://provider.cattle.io|provider.cattle.io>: harvester
  name: local
  namespace: fleet-local
  resourceVersion: "490972466"
  uid: 7aec5041-e2e1-4469-909e-314a62837976
spec:
  kubernetesVersion: v1.25.9+rke2r1
  localClusterAuthEndpoint: {}
  rkeConfig:
    chartValues: null
    machineGlobalConfig: null
    provisionGeneration: 1
    upgradeStrategy:
      controlPlaneDrainOptions:
        deleteEmptyDirData: false
        disableEviction: false
        enabled: false
        force: false
        gracePeriod: 0
        ignoreDaemonSets: null
        postDrainHooks: null
        preDrainHooks: null
        skipWaitForDeleteTimeoutSeconds: 0
        timeout: 0
      workerDrainOptions:
        deleteEmptyDirData: false
        disableEviction: false
        enabled: false
        force: false
        gracePeriod: 0
        ignoreDaemonSets: null
        postDrainHooks: null
        preDrainHooks: null
        skipWaitForDeleteTimeoutSeconds: 0
        timeout: 0
status:
  clientSecretName: local-kubeconfig
  clusterName: local
  conditions:
  - lastUpdateTime: "2023-10-27T14:20:36Z"
    message: marking control plane as initialized and ready
    reason: Waiting
    status: Unknown
    type: Ready
  - lastUpdateTime: "2023-10-27T14:18:54Z"
    status: "False"
    type: Reconciling
  - lastUpdateTime: "2023-10-27T14:18:54Z"
    status: "False"
    type: Stalled
  - lastUpdateTime: "2023-11-11T11:30:32Z"
    status: "True"
    type: Created
  - lastUpdateTime: "2024-08-02T17:20:28Z"
    status: "True"
    type: RKECluster
  - status: Unknown
    type: DefaultProjectCreated
  - status: Unknown
    type: SystemProjectCreated
  - lastUpdateTime: "2023-10-27T14:19:09Z"
    status: "True"
    type: Connected
  - lastUpdateTime: "2024-08-02T10:09:32Z"
    status: "True"
    type: Updated
  - lastUpdateTime: "2024-08-02T10:09:32Z"
    status: "True"
    type: Provisioned
  observedGeneration: 4
  ready: true

worried-state-78253

08/04/2024, 2:04 PM

The manifest log ends -

Copy code

Wait for cluster settling down...
Waiting for CAPI cluster fleet-local/local to be provisioned (current phase: Provisioned, current generation: 801608)...
Waiting for CAPI cluster fleet-local/local to be provisioned (current phase: Provisioned, current generation: 801608)...
Waiting for CAPI cluster fleet-local/local to be provisioned (current phase: Provisioned, current generation: 801608)...
Waiting for CAPI cluster fleet-local/local to be provisioned (current phase: Provisioned, current generation: 801608)...
Waiting for CAPI cluster fleet-local/local to be provisioned (current phase: Provisioned, current generation: 801608)...
Waiting for CAPI cluster fleet-local/local to be provisioned (current phase: Provisioned, current generation: 801608)...
Waiting for CAPI cluster fleet-local/local to be provisioned (current phase: Provisioned, current generation: 801608)...
Waiting for CAPI cluster fleet-local/local to be provisioned (current phase: Provisioned, current generation: 801608)...
CAPI cluster fleet-local/local is provisioned (current generation: 801610).
<http://cluster.fleet.cattle.io/local|cluster.fleet.cattle.io/local> patched
waiting for fleet-agent creation timestamp to be updated
waiting for fleet-agent creation timestamp to be updated
waiting for fleet-agent creation timestamp to be updated
waiting for fleet-agent creation timestamp to be updated
waiting for fleet-agent creation timestamp to be updated

The waiting for fleet-agent message repeats indefinately.

worried-state-78253

08/04/2024, 2:12 PM

Checking certs on each of the nodes

Copy code

n1:/ # (
> curl  --cacert /var/lib/rancher/rke2/server/tls/kube-controller-manager/kube-controller-manager.crt \
>   <https://127.0.0.1:10257/healthz> >/dev/null 2>&1 \
>   && echo "[OK] Kube Controller probe" \
>   || echo "[FAIL] Kube Controller probe";
> 
> curl --cacert /var/lib/rancher/rke2/server/tls/kube-scheduler/kube-scheduler.crt \
>   <https://127.0.0.1:10259/healthz> >/dev/null 2>&1  \
>   && echo "[OK] Scheduler probe" \
>   || echo "[FAIL] Scheduler probe";
> )
[OK] Kube Controller probe
[OK] Scheduler probe

However n4/n5 do fail!

Copy code

n5:/ # (
> curl  --cacert /var/lib/rancher/rke2/server/tls/kube-controller-manager/kube-controller-manager.crt \
>   <https://127.0.0.1:10257/healthz> >/dev/null 2>&1 \
>   && echo "[OK] Kube Controller probe" \
>   || echo "[FAIL] Kube Controller probe";
> 
> curl --cacert /var/lib/rancher/rke2/server/tls/kube-scheduler/kube-scheduler.crt \
>   <https://127.0.0.1:10259/healthz> >/dev/null 2>&1  \
>   && echo "[OK] Scheduler probe" \
>   || echo "[FAIL] Scheduler probe";
> )
[FAIL] Kube Controller probe
[FAIL] Scheduler probe

So looks like this is the issue…

worried-state-78253

08/04/2024, 2:19 PM

Ahh - but n4/n5 are workers not part of the control plane - don’t think this is an issue then

worried-state-78253

08/04/2024, 4:59 PM

I’ve opened on the github tracker here - https://github.com/harvester/harvester/issues/6255

worried-state-78253

08/04/2024, 5:07 PM

Support bundle attached - note cluster management lists a broken cluster currently - this is likely broken because the upgrade was part complete when I tried to remove it, its disposable. VM’s however we would like to retain save rebuilding/restoring from backups. Hoping this is something simple and can be solved by configuration? I’ve not yet tried rebooting the nodes, not sure if its advisable to do an upgrade from ISO instead - or if we should just start fro scratch and move direct to latest version and restore all vms from backup… obviously hoping that this can be ironed out with less effort since it will take many hours to get this restored if we have to reset and start again. Any pointers greatly appreciated - I’ve tried a few things but running out of ideas.

supportbundle_9af7d11f-d06e-44e2-b611-2acd3a809037_2024-08-04T17-00-21Z.zip

👀 1

prehistoric-balloon-31801

08/05/2024, 6:34 AM

gaurav left some comments on the issue, please check.

👍 1

worried-state-78253

08/05/2024, 5:26 PM

updated logs from the upgrade panel - just updating github issue as not sure we're winning yet

hvst-upgrade-ns4p2-upgradelog-archive-2024-08-05T17-24-57Z.zip

worried-state-78253

08/06/2024, 11:21 AM

Posted some further info on the issue - though with my limited knowledge I'm not sure how useful this will be, woud be helpful to know if there is something simple a miss, whether restarts of the process, the nodes or re-install via ISO and or start again is the answer... I cant do anything major on the cluster until tonight so appreciate any feedback we can get before we go nuclear - think if we cant find a solution by Friday I'll be re-installing fresh, hoping that we can solve this before then :/

worried-state-78253

08/08/2024, 4:02 PM

We're still stuck without a paddle at the moment on this install - @prehistoric-balloon-31801 do you know if its safe to re-start the upgrade at phase 3 (upgrading system services) since we are stuck here - we've removed our k8's cluster and deleted it now as it was disposable - but the suggestions from your colleague do not appear to resume the install. Appreciate any help here - I'm fast approaching re-install and restore conclusion but was hoping that wasn't needed... thanks in advance.

worried-state-78253

08/09/2024, 9:50 PM

given everything a reboot - still no movement. Going to have to manually upgrade unless some light is shed.

worried-state-78253

08/12/2024, 8:54 AM

Reboot made no difference, created an ISO and attempted an update from USB - but that option doesn't appear to be present in 1.2.2 - yet it's still in the documentation? Moving toward full reinstall which is frustrating.

prehistoric-balloon-31801

08/13/2024, 3:03 AM

@worried-state-78253 do you have the lastest support bundle? the latest one in the thread was captured in Aug 4

worried-state-78253

08/13/2024, 9:08 AM

Hi - I can add a new one - I've just taken the cluster down to 3 nodes now and setup a new one alongside on latest version, but would be good to understand whats causing the problem as it'll save migrating VM's and we can pull the resources back into this cluster. "admission webhook "validator.harvesterhci.io" denied the request: managed chart rancher-monitoring-crd is not ready, please wait for it to be ready" Is the message we get now on upgrade. Attached the bundle from today -

supportbundle_9af7d11f-d06e-44e2-b611-2acd3a809037_2024-08-13T09-04-24Z.zip

worried-state-78253

08/13/2024, 9:08 AM

Will update the issue too effect.

worried-state-78253

08/13/2024, 9:14 AM

Honestly tho if it makes sense for us to restore from backup to new cluster we can do that now, I am however trying to understand the snags for continuity.

46 Views

Open in Slack

Previous Next