This message was deleted Rancher Users #harvester

Join Slack

This message was deleted.

# harvester

adamant-kite-43734

09/15/2024, 3:20 PM

This message was deleted.

worried-state-78253

09/15/2024, 3:31 PM

I think that the hint here is that n0 isn’t showing as control-plane as it was setup as a witness node on install, this I think is pointing toward an issue with that machine.

worried-state-78253

09/15/2024, 3:38 PM

Looking in Lens at the pods on the harvester cluster I can also see one or two unhealthy pods on n0, further pointing to issues with this machine. The support package pod I can see in the logs looks to hang when it gets to n0 - supportbundle-manager-bundle-XXXX

Copy code

time="2024-09-15T15:28:42Z" level=debug msg="Expecting bundles from nodes: map[n0: n1: n2: n3: n4: n5:]"
time="2024-09-15T15:29:13Z" level=debug msg="Handle create node bundle for n2"
time="2024-09-15T15:29:13Z" level=debug msg="Complete node n2"
time="2024-09-15T15:29:14Z" level=debug msg="Handle create node bundle for n3"
time="2024-09-15T15:29:14Z" level=debug msg="Complete node n3"
time="2024-09-15T15:29:15Z" level=debug msg="Handle create node bundle for n1"
time="2024-09-15T15:29:15Z" level=debug msg="Complete node n1"
time="2024-09-15T15:29:17Z" level=debug msg="Handle create node bundle for n4"
time="2024-09-15T15:29:17Z" level=debug msg="Complete node n4"
time="2024-09-15T15:29:21Z" level=debug msg="Handle create node bundle for n5"
time="2024-09-15T15:29:21Z" level=debug msg="Complete node n5"

If we compare the logs from n0's supportbundle-agent-bundle-XXX vs the others - n0 stops here -

Copy code

+ curl -v -i -H 'Content-Type: application/zip' --data-binary @node_bundle.zip <http://10.52.10.15:8080/nodes/n0>
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying 10.52.10.15:8080...

where as the others have gone further -

Copy code

+ curl -v -i -H 'Content-Type: application/zip' --data-binary @node_bundle.zip <http://10.52.10.15:8080/nodes/n1>
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying 10.52.10.15:8080...
* Connected to 10.52.10.15 (10.52.10.15) port 8080 (#0)
> POST /nodes/n1 HTTP/1.1
> Host: 10.52.10.15:8080
> User-Agent: curl/8.0.1
> Accept: */*
> Content-Type: application/zip
> Content-Length: 1372048
> Expect: 100-continue
> 
< HTTP/1.1 100 Continue
} [65536 bytes data]
* We are completely uploaded and fine
< HTTP/1.1 201 Created
< Date: Sun, 15 Sep 2024 15:29:15 GMT
< Content-Length: 0
< 

1339k    0     0  100 1339k      0   320M --:--:-- --:--:-- --:--:--  327M
* Connection #0 to host 10.52.10.15 left intact
HTTP/1.1 100 Continue

HTTP/1.1 201 Created
Date: Sun, 15 Sep 2024 15:29:15 GMT
Content-Length: 0

+ sleep infinity

I’m going to remove n0 completely from the cluster and try again!

worried-state-78253

09/15/2024, 3:44 PM

OK - waited until all pods clear of n0 and it looked out of bounds… hitting upgrade produced same result with the error in admission hook. “admission webhook “validator.harvesterhci.io” denied the request: managed chart harvester is not ready, please wait for it to be ready” Support bundle attached.

supportbundle_4d170f3b-38c0-41ba-b05c-dfef3d4a621a_2024-09-15T15-41-01Z.zip

worried-state-78253

09/16/2024, 9:22 AM

Copy code

n1:/ # kubectl get bundle -A
NAMESPACE     NAME                                          BUNDLEDEPLOYMENTS-READY   STATUS
fleet-local   fleet-agent-local                             1/1                       
fleet-local   local-managed-system-agent                    1/1                       
fleet-local   mcc-harvester                                 0/1                       Modified(1) [Cluster fleet-local/local]; <http://storageclass.storage.k8s.io|storageclass.storage.k8s.io> harvester-longhorn missing
fleet-local   mcc-harvester-crd                             1/1                       
fleet-local   mcc-local-managed-system-upgrade-controller   1/1                       
fleet-local   mcc-rancher-logging-crd                       1/1                       
fleet-local   mcc-rancher-monitoring-crd                    1/1

Not sure if above has any significance "Modified(1) [Cluster fleet-local/local]; storageclass.storage.k8s.io harvester-longhorn missing" ran on a control plane node.

worried-state-78253

09/16/2024, 9:24 AM

Guess i need to dig into the specifics behind "admission webhook "validator.harvesterhci.io" denied the request: managed chart harvester is not ready, please wait for it to be ready" in the first place... will deep dive again tonight.

enough-australia-5601

09/16/2024, 11:00 AM

Hi Craig, I'm looking at your support bundle and I found a few of these error messages:

Copy code

logs/kube-system/rke2-metrics-server-7f745dbddf-crckz/metrics-server.log:2024-09-12T13:11:05.801669145Z E0912 13:11:05.801609       1 scraper.go:149] "Failed to scrape node" err="Get \"<https://192.168.122.134:10250/metrics/resource>\": dial tcp 192.168.122.134:10250: connect: no route to host" node="n0"                                                                                                                          logs/kube-system/rke2-metrics-server-7f745dbddf-crckz/metrics-server.log:2024-09-12T13:11:20.777705961Z E0912 13:11:20.777630       1 scraper.go:149] "Failed to scrape node" err="Get \"<https://192.168.122.134:10250/metrics/resource>\": dial tcp 192.168.122.134:10250: connect: no route to host" node="n0"                                                                                                                          logs/kube-system/rke2-metrics-server-7f745dbddf-crckz/metrics-server.log:2024-09-12T13:11:35.817668323Z E0912 13:11:35.817610       1 scraper.go:149] "Failed to scrape node" err="Get \"<https://192.168.122.134:10250/metrics/resource>\": dial tcp 192.168.122.134:10250: connect: no route to host" node="n0"                                                                                                                          
[...]
logs/kube-system/rke2-metrics-server-7f745dbddf-crckz/metrics-server.log:2024-09-15T14:52:20.809677447Z E0915 14:52:20.809617       1 scraper.go:149] "Failed to scrape node" err="Get \"<https://192.168.122.134:10250/metrics/resource>\": dial tcp 192.168.122.134:10250: connect: no route to host" node="n0"                                                                                                                          logs/kube-system/rke2-metrics-server-7f745dbddf-crckz/metrics-server.log:2024-09-15T14:52:35.789668865Z E0915 14:52:35.789605       1 scraper.go:149] "Failed to scrape node" err="Get \"<https://192.168.122.134:10250/metrics/resource>\": dial tcp 192.168.122.134:10250: connect: no route to host" node="n0"                                                                                                                          logs/kube-system/rke2-metrics-server-7f745dbddf-crckz/metrics-server.log:2024-09-15T14:52:50.765663858Z E0915 14:52:50.765595       1 scraper.go:149] "Failed to scrape node" err="Get \"<https://192.168.122.134:10250/metrics/resource>\": dial tcp 192.168.122.134:10250: connect: no route to host" node="n0"

Seems like your node n0 has somehow disappeared at some point on September 12th? Could you please describe what the underlying hardware of your setup is, and how history with n0 being a witness node went down? I gathered from your earlier messages that you had a two-node setup with n0 as witness node. Did you then first try the upgrade or first add the other nodes?

worried-state-78253

09/16/2024, 11:11 AM

n0 is the node we've removed from the cluster - it was a witness node used initially, I think that machine wasnt great and it clearly got demoted as part of the control-plane as 3 other machines were working as such. Is there a correct way to clean up an old node other than just deleting it from the UI?

worried-state-78253

09/16/2024, 11:13 AM

i'll try reinstalling the metrics perhaps... will do that now as maybe that will sort that specific issue out

enough-australia-5601

09/16/2024, 11:15 AM

I figured that, but I'm still trying to wrap my head around the order in which things happened. n0 was a witness node. I assume n4 and n5 were the two control planes at that point. Then you added more nodes and it seems n1 got promoted to control plane as well. Was this intended or an accident?

worried-state-78253

09/16/2024, 11:18 AM

yes - basically n4/5 were taken from a previous 1.2 cluster and we added the witness node so we could run latest version alongside the old cluster which had failed to upgrade. Once all the workloads were moved over we then reset, upgraded the old hardware and pulled the other nodes across. We also updated the network so they all have a 10GSPF+ for management and one dedicated for storage to help improve the stability of big file operations.

enough-australia-5601

09/16/2024, 11:29 AM

So at one point you had two entirely separate v1.3.1 clusters and tried to join them into one cluster? I'm quite sure this operation isn't supported at all. Or did I misunderstand that?

worried-state-78253

09/16/2024, 11:32 AM

no - you misunderstand - we had a 1.2 cluster and a 1.3 cluster - we decommissioned the 1.2 cluster and moved the workloads over by rebuilding the VM's on the new cluster and syncing as that 1.2 cluster was broken and we ran out of patience given we had enough machines to run 2 harvester clusters at the same time if we used a witness node with the new harvester cluster. Disabling metrics still fails to pass the admission web-hook btw. I guess I need to look at what it's checking for admission... not sure where to look atm.

worried-state-78253

09/16/2024, 11:33 AM

(we also have a dedicated rancher cluster of low powered machines setup in HA for management and registered the harvester setups to this)

enough-australia-5601

09/16/2024, 11:41 AM

The managed chart (

<http://ManagedChart.management.cattle.io/v3|ManagedChart.management.cattle.io/v3>

) named

harvester

in the

fleet-local

namespace is not ready:

Copy code

status:
    conditions:
    - lastUpdateTime: "2024-09-05T01:42:17Z"
      message: Modified(1) [Cluster fleet-local/local]; <http://storageclass.storage.k8s.io|storageclass.storage.k8s.io>
        harvester-longhorn missing
      status: "False"
      type: Ready
    - lastUpdateTime: "2024-09-05T01:42:17Z"
      status: "True"
      type: Processed
    - lastUpdateTime: "2024-09-15T15:36:20Z"
      status: "True"
      type: Defined

Looks like it expects there to be a storage class named

harvester-longhorn

(which is there by default, I think), but in your setup that storage class has been removed. Perhaps you can move the upgrade along by just adding such a storage class?

🙌 1

worried-state-78253

09/16/2024, 11:52 AM

i'll try that tonight - thanks for the spot!

👍 1

worried-state-78253

09/16/2024, 4:16 PM

ok - no joy as of yet... new bundle attached

supportbundle_4d170f3b-38c0-41ba-b05c-dfef3d4a621a_2024-09-16T16-13-14Z.zip

worried-state-78253

09/16/2024, 4:16 PM

I'm just looking through the troubleshooting article, but since the process doesn't even begin I'm a bit stumped.

worried-state-78253

09/16/2024, 7:36 PM

just added that storage class to the cluster “local” which I think is the cluster harvester provisions to manage itself. Its also added to the harvester list of storage classes, I removed it originally from the harvester storage classes because it lacked specificity over what drives the volumes would be created. Anyway despite this the upgrade is still not initiating. Just shutdown all vms and tried again same result. Fresh support bundle attached - this should show the storage present.

supportbundle_4d170f3b-38c0-41ba-b05c-dfef3d4a621a_2024-09-16T19-33-25Z.zip

enough-australia-5601

09/17/2024, 8:03 AM

Hey Craig, seems like something went wrong with adding that storage class. The managed helm chart is still showing the same status, even in the new support bundle:

Copy code

status:
    conditions:
    - lastUpdateTime: "2024-09-05T01:42:17Z"
      message: Modified(1) [Cluster fleet-local/local]; <http://storageclass.storage.k8s.io|storageclass.storage.k8s.io>
        harvester-longhorn missing
      status: "False"
      type: Ready
    - lastUpdateTime: "2024-09-05T01:42:17Z"
      status: "True"
      type: Processed
    - lastUpdateTime: "2024-09-16T19:31:20Z"
      status: "True"
      type: Defined

The cluster

local

in Rancher is the cluster that Rancher runs on. This isn't the same cluster as Harvester. You can add the storage class from the Harvester UI directly. If you access Harvester through Rancher, you can navigate to the Harvester UI by

Virtualization Management

Harvester Clusters

select your Harvester cluster

worried-state-78253

09/17/2024, 8:50 AM

Humm - harvester storage classes include the harvester-longhorn -

Copy code

allowVolumeExpansion: true
apiVersion: <http://storage.k8s.io/v1|storage.k8s.io/v1>
kind: StorageClass
metadata:
  annotations:
    <http://field.cattle.io/description|field.cattle.io/description>: origional default storage
  creationTimestamp: '2024-09-16T16:01:11Z'
  managedFields:
    - apiVersion: <http://storage.k8s.io/v1|storage.k8s.io/v1>
      fieldsType: FieldsV1
      fieldsV1:
        f:allowVolumeExpansion: {}
        f:metadata:
          f:annotations:
            .: {}
            f:<http://field.cattle.io/description|field.cattle.io/description>: {}
        f:parameters:
          .: {}
          f:diskSelector: {}
          f:migratable: {}
          f:numberOfReplicas: {}
          f:staleReplicaTimeout: {}
        f:provisioner: {}
        f:reclaimPolicy: {}
        f:volumeBindingMode: {}
      manager: harvester
      operation: Update
      time: '2024-09-16T16:01:11Z'
  name: harvester-longhorn
  resourceVersion: '52841563'
  uid: 1c9f79e3-9a75-44d6-bfca-eba4b7ad9d9b
parameters:
  diskSelector: hdd
  migratable: 'true'
  numberOfReplicas: '3'
  staleReplicaTimeout: '30'
provisioner: <http://driver.longhorn.io|driver.longhorn.io>
reclaimPolicy: Delete
volumeBindingMode: Immediate

worried-state-78253

09/17/2024, 8:51 AM

However i have a k8s cluster provisioned in harvester that doesnt have this storage class listed -

worried-state-78253

09/17/2024, 8:53 AM

So thats the storage classes listed for that cluster - web-engine-1 is a harvester cluster, web-engine-c1 is a k8s deployment within it

glamorous-sunset-66832

09/17/2024, 12:13 PM

I had the same issue, but got instructions by @clean-cpu-90380 https://github.com/harvester/harvester/issues/6561#issuecomment-2351938832 You have to fetch the manifest and apply it so that you get the correct annotations.

worried-state-78253

09/17/2024, 1:00 PM

this sounds like its on the money - will try this later this evening and let you know the results!

✅ 1

worried-state-78253

09/17/2024, 1:01 PM

think its just a case of updating the storage class hash in the annotations, then it can crack on again! Need to wait now until cluster is not in use...

worried-state-78253

09/25/2024, 9:57 AM

yep - fixes - had to remove my manually created storage class first, then clear the default and apply the chart and bingo - its happy - many thanks!

worried-state-78253

09/25/2024, 9:58 AM

well - its downloading now having passed initial validation...

worried-state-78253

09/25/2024, 9:59 AM

hopefully this will move on soon 🙂

worried-state-78253

09/25/2024, 10:05 AM

its progressing so happy days 🙂

worried-state-78253

09/25/2024, 11:05 AM

very impressed so far - its live migrating vm's and updating all nodes, I've updated the gitlab issue linked to note it worked for us too - many thanks @glamorous-sunset-66832

157 Views

Open in Slack

Previous Next