This message was deleted.
# harvester
a
This message was deleted.
w
I think that the hint here is that n0 isn’t showing as control-plane as it was setup as a witness node on install, this I think is pointing toward an issue with that machine.
Looking in Lens at the pods on the harvester cluster I can also see one or two unhealthy pods on n0, further pointing to issues with this machine. The support package pod I can see in the logs looks to hang when it gets to n0 - supportbundle-manager-bundle-XXXX
Copy code
time="2024-09-15T15:28:42Z" level=debug msg="Expecting bundles from nodes: map[n0: n1: n2: n3: n4: n5:]"
time="2024-09-15T15:29:13Z" level=debug msg="Handle create node bundle for n2"
time="2024-09-15T15:29:13Z" level=debug msg="Complete node n2"
time="2024-09-15T15:29:14Z" level=debug msg="Handle create node bundle for n3"
time="2024-09-15T15:29:14Z" level=debug msg="Complete node n3"
time="2024-09-15T15:29:15Z" level=debug msg="Handle create node bundle for n1"
time="2024-09-15T15:29:15Z" level=debug msg="Complete node n1"
time="2024-09-15T15:29:17Z" level=debug msg="Handle create node bundle for n4"
time="2024-09-15T15:29:17Z" level=debug msg="Complete node n4"
time="2024-09-15T15:29:21Z" level=debug msg="Handle create node bundle for n5"
time="2024-09-15T15:29:21Z" level=debug msg="Complete node n5"
If we compare the logs from n0's supportbundle-agent-bundle-XXX vs the others - n0 stops here -
Copy code
+ curl -v -i -H 'Content-Type: application/zip' --data-binary @node_bundle.zip <http://10.52.10.15:8080/nodes/n0>
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying 10.52.10.15:8080...
where as the others have gone further -
Copy code
+ curl -v -i -H 'Content-Type: application/zip' --data-binary @node_bundle.zip <http://10.52.10.15:8080/nodes/n1>
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying 10.52.10.15:8080...
* Connected to 10.52.10.15 (10.52.10.15) port 8080 (#0)
> POST /nodes/n1 HTTP/1.1
> Host: 10.52.10.15:8080
> User-Agent: curl/8.0.1
> Accept: */*
> Content-Type: application/zip
> Content-Length: 1372048
> Expect: 100-continue
> 
< HTTP/1.1 100 Continue
} [65536 bytes data]
* We are completely uploaded and fine
< HTTP/1.1 201 Created
< Date: Sun, 15 Sep 2024 15:29:15 GMT
< Content-Length: 0
< 

1339k    0     0  100 1339k      0   320M --:--:-- --:--:-- --:--:--  327M
* Connection #0 to host 10.52.10.15 left intact
HTTP/1.1 100 Continue

HTTP/1.1 201 Created
Date: Sun, 15 Sep 2024 15:29:15 GMT
Content-Length: 0

+ sleep infinity
I’m going to remove n0 completely from the cluster and try again!
OK - waited until all pods clear of n0 and it looked out of bounds… hitting upgrade produced same result with the error in admission hook. “admission webhook “validator.harvesterhci.io” denied the request: managed chart harvester is not ready, please wait for it to be ready” Support bundle attached.
Copy code
n1:/ # kubectl get bundle -A
NAMESPACE     NAME                                          BUNDLEDEPLOYMENTS-READY   STATUS
fleet-local   fleet-agent-local                             1/1                       
fleet-local   local-managed-system-agent                    1/1                       
fleet-local   mcc-harvester                                 0/1                       Modified(1) [Cluster fleet-local/local]; <http://storageclass.storage.k8s.io|storageclass.storage.k8s.io> harvester-longhorn missing
fleet-local   mcc-harvester-crd                             1/1                       
fleet-local   mcc-local-managed-system-upgrade-controller   1/1                       
fleet-local   mcc-rancher-logging-crd                       1/1                       
fleet-local   mcc-rancher-monitoring-crd                    1/1
Not sure if above has any significance "Modified(1) [Cluster fleet-local/local]; storageclass.storage.k8s.io harvester-longhorn missing" ran on a control plane node.
Guess i need to dig into the specifics behind "admission webhook "validator.harvesterhci.io" denied the request: managed chart harvester is not ready, please wait for it to be ready" in the first place... will deep dive again tonight.
e
Hi Craig, I'm looking at your support bundle and I found a few of these error messages:
Copy code
logs/kube-system/rke2-metrics-server-7f745dbddf-crckz/metrics-server.log:2024-09-12T13:11:05.801669145Z E0912 13:11:05.801609       1 scraper.go:149] "Failed to scrape node" err="Get \"<https://192.168.122.134:10250/metrics/resource>\": dial tcp 192.168.122.134:10250: connect: no route to host" node="n0"                                                                                                                          logs/kube-system/rke2-metrics-server-7f745dbddf-crckz/metrics-server.log:2024-09-12T13:11:20.777705961Z E0912 13:11:20.777630       1 scraper.go:149] "Failed to scrape node" err="Get \"<https://192.168.122.134:10250/metrics/resource>\": dial tcp 192.168.122.134:10250: connect: no route to host" node="n0"                                                                                                                          logs/kube-system/rke2-metrics-server-7f745dbddf-crckz/metrics-server.log:2024-09-12T13:11:35.817668323Z E0912 13:11:35.817610       1 scraper.go:149] "Failed to scrape node" err="Get \"<https://192.168.122.134:10250/metrics/resource>\": dial tcp 192.168.122.134:10250: connect: no route to host" node="n0"                                                                                                                          
[...]
logs/kube-system/rke2-metrics-server-7f745dbddf-crckz/metrics-server.log:2024-09-15T14:52:20.809677447Z E0915 14:52:20.809617       1 scraper.go:149] "Failed to scrape node" err="Get \"<https://192.168.122.134:10250/metrics/resource>\": dial tcp 192.168.122.134:10250: connect: no route to host" node="n0"                                                                                                                          logs/kube-system/rke2-metrics-server-7f745dbddf-crckz/metrics-server.log:2024-09-15T14:52:35.789668865Z E0915 14:52:35.789605       1 scraper.go:149] "Failed to scrape node" err="Get \"<https://192.168.122.134:10250/metrics/resource>\": dial tcp 192.168.122.134:10250: connect: no route to host" node="n0"                                                                                                                          logs/kube-system/rke2-metrics-server-7f745dbddf-crckz/metrics-server.log:2024-09-15T14:52:50.765663858Z E0915 14:52:50.765595       1 scraper.go:149] "Failed to scrape node" err="Get \"<https://192.168.122.134:10250/metrics/resource>\": dial tcp 192.168.122.134:10250: connect: no route to host" node="n0"
Seems like your node n0 has somehow disappeared at some point on September 12th? Could you please describe what the underlying hardware of your setup is, and how history with n0 being a witness node went down? I gathered from your earlier messages that you had a two-node setup with n0 as witness node. Did you then first try the upgrade or first add the other nodes?
w
n0 is the node we've removed from the cluster - it was a witness node used initially, I think that machine wasnt great and it clearly got demoted as part of the control-plane as 3 other machines were working as such. Is there a correct way to clean up an old node other than just deleting it from the UI?
i'll try reinstalling the metrics perhaps... will do that now as maybe that will sort that specific issue out
e
I figured that, but I'm still trying to wrap my head around the order in which things happened. n0 was a witness node. I assume n4 and n5 were the two control planes at that point. Then you added more nodes and it seems n1 got promoted to control plane as well. Was this intended or an accident?
w
yes - basically n4/5 were taken from a previous 1.2 cluster and we added the witness node so we could run latest version alongside the old cluster which had failed to upgrade. Once all the workloads were moved over we then reset, upgraded the old hardware and pulled the other nodes across. We also updated the network so they all have a 10GSPF+ for management and one dedicated for storage to help improve the stability of big file operations.
e
So at one point you had two entirely separate v1.3.1 clusters and tried to join them into one cluster? I'm quite sure this operation isn't supported at all. Or did I misunderstand that?
w
no - you misunderstand - we had a 1.2 cluster and a 1.3 cluster - we decommissioned the 1.2 cluster and moved the workloads over by rebuilding the VM's on the new cluster and syncing as that 1.2 cluster was broken and we ran out of patience given we had enough machines to run 2 harvester clusters at the same time if we used a witness node with the new harvester cluster. Disabling metrics still fails to pass the admission web-hook btw. I guess I need to look at what it's checking for admission... not sure where to look atm.
(we also have a dedicated rancher cluster of low powered machines setup in HA for management and registered the harvester setups to this)
e
The managed chart (
<http://ManagedChart.management.cattle.io/v3|ManagedChart.management.cattle.io/v3>
) named
harvester
in the
fleet-local
namespace is not ready:
Copy code
status:
    conditions:
    - lastUpdateTime: "2024-09-05T01:42:17Z"
      message: Modified(1) [Cluster fleet-local/local]; <http://storageclass.storage.k8s.io|storageclass.storage.k8s.io>
        harvester-longhorn missing
      status: "False"
      type: Ready
    - lastUpdateTime: "2024-09-05T01:42:17Z"
      status: "True"
      type: Processed
    - lastUpdateTime: "2024-09-15T15:36:20Z"
      status: "True"
      type: Defined
Looks like it expects there to be a storage class named
harvester-longhorn
(which is there by default, I think), but in your setup that storage class has been removed. Perhaps you can move the upgrade along by just adding such a storage class?
🙌 1
w
i'll try that tonight - thanks for the spot!
👍 1
I'm just looking through the troubleshooting article, but since the process doesn't even begin I'm a bit stumped.
just added that storage class to the cluster “local” which I think is the cluster harvester provisions to manage itself. Its also added to the harvester list of storage classes, I removed it originally from the harvester storage classes because it lacked specificity over what drives the volumes would be created. Anyway despite this the upgrade is still not initiating. Just shutdown all vms and tried again same result. Fresh support bundle attached - this should show the storage present.
e
Hey Craig, seems like something went wrong with adding that storage class. The managed helm chart is still showing the same status, even in the new support bundle:
Copy code
status:
    conditions:
    - lastUpdateTime: "2024-09-05T01:42:17Z"
      message: Modified(1) [Cluster fleet-local/local]; <http://storageclass.storage.k8s.io|storageclass.storage.k8s.io>
        harvester-longhorn missing
      status: "False"
      type: Ready
    - lastUpdateTime: "2024-09-05T01:42:17Z"
      status: "True"
      type: Processed
    - lastUpdateTime: "2024-09-16T19:31:20Z"
      status: "True"
      type: Defined
The cluster
local
in Rancher is the cluster that Rancher runs on. This isn't the same cluster as Harvester. You can add the storage class from the Harvester UI directly. If you access Harvester through Rancher, you can navigate to the Harvester UI by
Virtualization Management
->
Harvester Clusters
->
select your Harvester cluster
.
w
Humm - harvester storage classes include the harvester-longhorn -
Copy code
allowVolumeExpansion: true
apiVersion: <http://storage.k8s.io/v1|storage.k8s.io/v1>
kind: StorageClass
metadata:
  annotations:
    <http://field.cattle.io/description|field.cattle.io/description>: origional default storage
  creationTimestamp: '2024-09-16T16:01:11Z'
  managedFields:
    - apiVersion: <http://storage.k8s.io/v1|storage.k8s.io/v1>
      fieldsType: FieldsV1
      fieldsV1:
        f:allowVolumeExpansion: {}
        f:metadata:
          f:annotations:
            .: {}
            f:<http://field.cattle.io/description|field.cattle.io/description>: {}
        f:parameters:
          .: {}
          f:diskSelector: {}
          f:migratable: {}
          f:numberOfReplicas: {}
          f:staleReplicaTimeout: {}
        f:provisioner: {}
        f:reclaimPolicy: {}
        f:volumeBindingMode: {}
      manager: harvester
      operation: Update
      time: '2024-09-16T16:01:11Z'
  name: harvester-longhorn
  resourceVersion: '52841563'
  uid: 1c9f79e3-9a75-44d6-bfca-eba4b7ad9d9b
parameters:
  diskSelector: hdd
  migratable: 'true'
  numberOfReplicas: '3'
  staleReplicaTimeout: '30'
provisioner: <http://driver.longhorn.io|driver.longhorn.io>
reclaimPolicy: Delete
volumeBindingMode: Immediate
However i have a k8s cluster provisioned in harvester that doesnt have this storage class listed -
So thats the storage classes listed for that cluster - web-engine-1 is a harvester cluster, web-engine-c1 is a k8s deployment within it
g
I had the same issue, but got instructions by @clean-cpu-90380 https://github.com/harvester/harvester/issues/6561#issuecomment-2351938832 You have to fetch the manifest and apply it so that you get the correct annotations.
w
this sounds like its on the money - will try this later this evening and let you know the results!
1
think its just a case of updating the storage class hash in the annotations, then it can crack on again! Need to wait now until cluster is not in use...
yep - fixes - had to remove my manually created storage class first, then clear the default and apply the chart and bingo - its happy - many thanks!
well - its downloading now having passed initial validation...
hopefully this will move on soon 🙂
its progressing so happy days 🙂
very impressed so far - its live migrating vm's and updating all nodes, I've updated the gitlab issue linked to note it worked for us too - many thanks @glamorous-sunset-66832
123 Views