adamant-kite-43734
09/15/2024, 3:20 PMworried-state-78253
09/15/2024, 3:31 PMworried-state-78253
09/15/2024, 3:38 PMtime="2024-09-15T15:28:42Z" level=debug msg="Expecting bundles from nodes: map[n0: n1: n2: n3: n4: n5:]"
time="2024-09-15T15:29:13Z" level=debug msg="Handle create node bundle for n2"
time="2024-09-15T15:29:13Z" level=debug msg="Complete node n2"
time="2024-09-15T15:29:14Z" level=debug msg="Handle create node bundle for n3"
time="2024-09-15T15:29:14Z" level=debug msg="Complete node n3"
time="2024-09-15T15:29:15Z" level=debug msg="Handle create node bundle for n1"
time="2024-09-15T15:29:15Z" level=debug msg="Complete node n1"
time="2024-09-15T15:29:17Z" level=debug msg="Handle create node bundle for n4"
time="2024-09-15T15:29:17Z" level=debug msg="Complete node n4"
time="2024-09-15T15:29:21Z" level=debug msg="Handle create node bundle for n5"
time="2024-09-15T15:29:21Z" level=debug msg="Complete node n5"
If we compare the logs from n0's supportbundle-agent-bundle-XXX vs the others -
n0 stops here -
+ curl -v -i -H 'Content-Type: application/zip' --data-binary @node_bundle.zip <http://10.52.10.15:8080/nodes/n0>
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0* Trying 10.52.10.15:8080...
where as the others have gone further -
+ curl -v -i -H 'Content-Type: application/zip' --data-binary @node_bundle.zip <http://10.52.10.15:8080/nodes/n1>
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0* Trying 10.52.10.15:8080...
* Connected to 10.52.10.15 (10.52.10.15) port 8080 (#0)
> POST /nodes/n1 HTTP/1.1
> Host: 10.52.10.15:8080
> User-Agent: curl/8.0.1
> Accept: */*
> Content-Type: application/zip
> Content-Length: 1372048
> Expect: 100-continue
>
< HTTP/1.1 100 Continue
} [65536 bytes data]
* We are completely uploaded and fine
< HTTP/1.1 201 Created
< Date: Sun, 15 Sep 2024 15:29:15 GMT
< Content-Length: 0
<
1339k 0 0 100 1339k 0 320M --:--:-- --:--:-- --:--:-- 327M
* Connection #0 to host 10.52.10.15 left intact
HTTP/1.1 100 Continue
HTTP/1.1 201 Created
Date: Sun, 15 Sep 2024 15:29:15 GMT
Content-Length: 0
+ sleep infinity
I’m going to remove n0 completely from the cluster and try again!worried-state-78253
09/15/2024, 3:44 PMworried-state-78253
09/16/2024, 9:22 AMn1:/ # kubectl get bundle -A
NAMESPACE NAME BUNDLEDEPLOYMENTS-READY STATUS
fleet-local fleet-agent-local 1/1
fleet-local local-managed-system-agent 1/1
fleet-local mcc-harvester 0/1 Modified(1) [Cluster fleet-local/local]; <http://storageclass.storage.k8s.io|storageclass.storage.k8s.io> harvester-longhorn missing
fleet-local mcc-harvester-crd 1/1
fleet-local mcc-local-managed-system-upgrade-controller 1/1
fleet-local mcc-rancher-logging-crd 1/1
fleet-local mcc-rancher-monitoring-crd 1/1
Not sure if above has any significance "Modified(1) [Cluster fleet-local/local]; storageclass.storage.k8s.io harvester-longhorn missing" ran on a control plane node.worried-state-78253
09/16/2024, 9:24 AMenough-australia-5601
09/16/2024, 11:00 AMlogs/kube-system/rke2-metrics-server-7f745dbddf-crckz/metrics-server.log:2024-09-12T13:11:05.801669145Z E0912 13:11:05.801609 1 scraper.go:149] "Failed to scrape node" err="Get \"<https://192.168.122.134:10250/metrics/resource>\": dial tcp 192.168.122.134:10250: connect: no route to host" node="n0" logs/kube-system/rke2-metrics-server-7f745dbddf-crckz/metrics-server.log:2024-09-12T13:11:20.777705961Z E0912 13:11:20.777630 1 scraper.go:149] "Failed to scrape node" err="Get \"<https://192.168.122.134:10250/metrics/resource>\": dial tcp 192.168.122.134:10250: connect: no route to host" node="n0" logs/kube-system/rke2-metrics-server-7f745dbddf-crckz/metrics-server.log:2024-09-12T13:11:35.817668323Z E0912 13:11:35.817610 1 scraper.go:149] "Failed to scrape node" err="Get \"<https://192.168.122.134:10250/metrics/resource>\": dial tcp 192.168.122.134:10250: connect: no route to host" node="n0"
[...]
logs/kube-system/rke2-metrics-server-7f745dbddf-crckz/metrics-server.log:2024-09-15T14:52:20.809677447Z E0915 14:52:20.809617 1 scraper.go:149] "Failed to scrape node" err="Get \"<https://192.168.122.134:10250/metrics/resource>\": dial tcp 192.168.122.134:10250: connect: no route to host" node="n0" logs/kube-system/rke2-metrics-server-7f745dbddf-crckz/metrics-server.log:2024-09-15T14:52:35.789668865Z E0915 14:52:35.789605 1 scraper.go:149] "Failed to scrape node" err="Get \"<https://192.168.122.134:10250/metrics/resource>\": dial tcp 192.168.122.134:10250: connect: no route to host" node="n0" logs/kube-system/rke2-metrics-server-7f745dbddf-crckz/metrics-server.log:2024-09-15T14:52:50.765663858Z E0915 14:52:50.765595 1 scraper.go:149] "Failed to scrape node" err="Get \"<https://192.168.122.134:10250/metrics/resource>\": dial tcp 192.168.122.134:10250: connect: no route to host" node="n0"
Seems like your node n0 has somehow disappeared at some point on September 12th?
Could you please describe what the underlying hardware of your setup is, and how history with n0 being a witness node went down?
I gathered from your earlier messages that you had a two-node setup with n0 as witness node. Did you then first try the upgrade or first add the other nodes?worried-state-78253
09/16/2024, 11:11 AMworried-state-78253
09/16/2024, 11:13 AMenough-australia-5601
09/16/2024, 11:15 AMworried-state-78253
09/16/2024, 11:18 AMenough-australia-5601
09/16/2024, 11:29 AMworried-state-78253
09/16/2024, 11:32 AMworried-state-78253
09/16/2024, 11:33 AMenough-australia-5601
09/16/2024, 11:41 AM<http://ManagedChart.management.cattle.io/v3|ManagedChart.management.cattle.io/v3>
) named harvester
in the fleet-local
namespace is not ready:
status:
conditions:
- lastUpdateTime: "2024-09-05T01:42:17Z"
message: Modified(1) [Cluster fleet-local/local]; <http://storageclass.storage.k8s.io|storageclass.storage.k8s.io>
harvester-longhorn missing
status: "False"
type: Ready
- lastUpdateTime: "2024-09-05T01:42:17Z"
status: "True"
type: Processed
- lastUpdateTime: "2024-09-15T15:36:20Z"
status: "True"
type: Defined
Looks like it expects there to be a storage class named harvester-longhorn
(which is there by default, I think), but in your setup that storage class has been removed.
Perhaps you can move the upgrade along by just adding such a storage class?worried-state-78253
09/16/2024, 11:52 AMworried-state-78253
09/16/2024, 4:16 PMworried-state-78253
09/16/2024, 4:16 PMworried-state-78253
09/16/2024, 7:36 PMenough-australia-5601
09/17/2024, 8:03 AMstatus:
conditions:
- lastUpdateTime: "2024-09-05T01:42:17Z"
message: Modified(1) [Cluster fleet-local/local]; <http://storageclass.storage.k8s.io|storageclass.storage.k8s.io>
harvester-longhorn missing
status: "False"
type: Ready
- lastUpdateTime: "2024-09-05T01:42:17Z"
status: "True"
type: Processed
- lastUpdateTime: "2024-09-16T19:31:20Z"
status: "True"
type: Defined
The cluster local
in Rancher is the cluster that Rancher runs on. This isn't the same cluster as Harvester. You can add the storage class from the Harvester UI directly. If you access Harvester through Rancher, you can navigate to the Harvester UI by Virtualization Management
-> Harvester Clusters
-> select your Harvester cluster
.worried-state-78253
09/17/2024, 8:50 AMallowVolumeExpansion: true
apiVersion: <http://storage.k8s.io/v1|storage.k8s.io/v1>
kind: StorageClass
metadata:
annotations:
<http://field.cattle.io/description|field.cattle.io/description>: origional default storage
creationTimestamp: '2024-09-16T16:01:11Z'
managedFields:
- apiVersion: <http://storage.k8s.io/v1|storage.k8s.io/v1>
fieldsType: FieldsV1
fieldsV1:
f:allowVolumeExpansion: {}
f:metadata:
f:annotations:
.: {}
f:<http://field.cattle.io/description|field.cattle.io/description>: {}
f:parameters:
.: {}
f:diskSelector: {}
f:migratable: {}
f:numberOfReplicas: {}
f:staleReplicaTimeout: {}
f:provisioner: {}
f:reclaimPolicy: {}
f:volumeBindingMode: {}
manager: harvester
operation: Update
time: '2024-09-16T16:01:11Z'
name: harvester-longhorn
resourceVersion: '52841563'
uid: 1c9f79e3-9a75-44d6-bfca-eba4b7ad9d9b
parameters:
diskSelector: hdd
migratable: 'true'
numberOfReplicas: '3'
staleReplicaTimeout: '30'
provisioner: <http://driver.longhorn.io|driver.longhorn.io>
reclaimPolicy: Delete
volumeBindingMode: Immediate
worried-state-78253
09/17/2024, 8:51 AMworried-state-78253
09/17/2024, 8:53 AMglamorous-sunset-66832
09/17/2024, 12:13 PMworried-state-78253
09/17/2024, 1:00 PMworried-state-78253
09/17/2024, 1:01 PMworried-state-78253
09/25/2024, 9:57 AMworried-state-78253
09/25/2024, 9:58 AMworried-state-78253
09/25/2024, 9:59 AMworried-state-78253
09/25/2024, 10:05 AMworried-state-78253
09/25/2024, 11:05 AM