This message was deleted Rancher Users #harvester

Join Slack

This message was deleted.

# harvester

adamant-kite-43734

09/11/2023, 4:54 PM

This message was deleted.

witty-jelly-95845

09/11/2023, 5:19 PM

you can download the version.yaml file and manually apply it then the button should show up

witty-jelly-95845

09/11/2023, 5:20 PM

then 🤞 😉

sticky-summer-13450

09/11/2023, 7:51 PM

Ah yes, thanks.

Copy code

curl <https://releases.rancher.com/harvester/v1.2.0/version.yaml> | kubectl apply --context=harvester-cluster -f -

Now I have everything crossed 🙂

sticky-summer-13450

09/12/2023, 7:54 AM

Harvester upgrade from 1.1.2 to 1.2.0 has got stuck 50% through the "Upgrading System Service" phase, after downloading everything and preloading the images onto the three nodes. Looking at the advice on the upgrade notes page the hvst-upgrade apply-manifests job is spewing out this message every 5 seconds.

Copy code

$ kubectl --context harvester003 -n harvester-system logs hvst-upgrade-6hp8q-apply-manifests-9j9m6 --tail=10
instance-manager-r pod count is not 1 on node harvester001, will retry...
instance-manager-r pod count is not 1 on node harvester001, will retry...
instance-manager-r pod count is not 1 on node harvester001, will retry...
instance-manager-r pod count is not 1 on node harvester001, will retry...

And it's true - there are two

instance-manager-r

pods on that node - one 11 hours old running

longhorn-instance-manager:v1.4.3

and the other 12 days old running

longhorn-instance-manager:v1_20221003

. I suppose I could delete the old one - but would like a little bit of confidence that this would be the correct remedy.

salmon-city-57654

09/12/2023, 9:08 AM

Hi @sticky-summer-13450, Could you generate the support bundle for it? Generally, if you confirm the whole volumes are healthy (replicas should be more than 2), you could delete pdb directly. Or you can attach the support bundle, and I can double-check for you.

👍 1

sticky-summer-13450

09/12/2023, 9:19 AM

There's nothing in Longhorn that is degraded or failed, but here's the support bundle:

supportbundle_5bb44244-434e-4530-ad35-35c4ef1ff661_2023-09-12T09-10-09Z.zip

👍 1

salmon-city-57654

09/12/2023, 9:22 AM

Thanks, let me check the SB…

👍 1

salmon-city-57654

09/12/2023, 10:37 AM

Hi @sticky-summer-13450, Could you help to open an issue for further analysis? Also, attach the above support bundle, thanks! I check for the old im-r. The replicas instance of this im-r are all deleted as below checking:

Copy code

$ kubectl get instancemanager instance-manager-r-1503169c -n longhorn-system -o yaml |yq -e ".status.instances" |grep name: > replica-list.txt
$ cat replica-list.txt |awk '{print $2}' |xargs -I {} kubectl get replicas {} -n longhorn-system
Error from server (NotFound): <http://replicas.longhorn.io|replicas.longhorn.io> "pvc-0ca5a4f3-d641-4b31-b33d-96b925d9af04-r-b0367b94" not found
Error from server (NotFound): <http://replicas.longhorn.io|replicas.longhorn.io> "pvc-0ca5a4f3-d641-4b31-b33d-96b925d9af04-r-e861fab9" not found
Error from server (NotFound): <http://replicas.longhorn.io|replicas.longhorn.io> "pvc-3f9a22e4-df30-45fc-b4c7-baed0c4ff217-r-894e8723" not found
Error from server (NotFound): <http://replicas.longhorn.io|replicas.longhorn.io> "pvc-3f9a22e4-df30-45fc-b4c7-baed0c4ff217-r-29032c1b" not found
Error from server (NotFound): <http://replicas.longhorn.io|replicas.longhorn.io> "pvc-3f9a22e4-df30-45fc-b4c7-baed0c4ff217-r-b6ada661" not found
Error from server (NotFound): <http://replicas.longhorn.io|replicas.longhorn.io> "pvc-3f9a22e4-df30-45fc-b4c7-baed0c4ff217-r-cb42c033" not found
Error from server (NotFound): <http://replicas.longhorn.io|replicas.longhorn.io> "pvc-3f9a22e4-df30-45fc-b4c7-baed0c4ff217-r-eb83b435" not found
Error from server (NotFound): <http://replicas.longhorn.io|replicas.longhorn.io> "pvc-67e7a314-7384-469f-9268-bdcd8728e526-r-eaeeb01b" not found
Error from server (NotFound): <http://replicas.longhorn.io|replicas.longhorn.io> "pvc-67e7a314-7384-469f-9268-bdcd8728e526-r-f4fcc792" not found
Error from server (NotFound): <http://replicas.longhorn.io|replicas.longhorn.io> "pvc-160d8d70-01d1-4a13-abd5-11cff2be6071-r-a2afb620" not found
Error from server (NotFound): <http://replicas.longhorn.io|replicas.longhorn.io> "pvc-160d8d70-01d1-4a13-abd5-11cff2be6071-r-e056af48" not found
Error from server (NotFound): <http://replicas.longhorn.io|replicas.longhorn.io> "pvc-a5b5fe4c-eca4-4c97-a3db-f9490980c044-r-a8d11e24" not found
Error from server (NotFound): <http://replicas.longhorn.io|replicas.longhorn.io> "pvc-a5b5fe4c-eca4-4c97-a3db-f9490980c044-r-d66c99b6" not found
Error from server (NotFound): <http://replicas.longhorn.io|replicas.longhorn.io> "pvc-a5b5fe4c-eca4-4c97-a3db-f9490980c044-r-eab71d39" not found
Error from server (NotFound): <http://replicas.longhorn.io|replicas.longhorn.io> "pvc-b5885e18-cc31-4ee1-8c91-afe881e09930-r-df556704" not found
Error from server (NotFound): <http://replicas.longhorn.io|replicas.longhorn.io> "pvc-c4dfa684-2e3a-496f-9396-0e137a8f85e7-r-f1dd92d2" not found
Error from server (NotFound): <http://replicas.longhorn.io|replicas.longhorn.io> "pvc-c108c3a1-bf5c-4d93-bb2b-99f1db4cc11c-r-1e8b6fa3" not found
Error from server (NotFound): <http://replicas.longhorn.io|replicas.longhorn.io> "pvc-d01443ee-14fa-42c0-8721-b08935d5eaae-r-36c82c20" not found
Error from server (NotFound): <http://replicas.longhorn.io|replicas.longhorn.io> "pvc-d01443ee-14fa-42c0-8721-b08935d5eaae-r-79da0593" not found

But somehow, they all exist on the

instancemanager

. That’s why this im-r could not be deleted. I checked the whole attached volumes, and it looks like they are all healthy. So you could directly remove this im-r to make the upgrade continue.

👍 1

sticky-summer-13450

09/12/2023, 11:15 AM

Reported in https://github.com/harvester/harvester/issues/4517 I'll go and remove that older

instance-manager-r

pod.

👍 1

salmon-city-57654

09/12/2023, 11:27 AM

Thanks, feel free to update here with your upgrade progress.

sticky-summer-13450

09/12/2023, 11:30 AM

will do - I'm currently waiting for something to happen after deleting the pod.

sticky-summer-13450

09/12/2023, 11:40 AM

I have deleted that instance-manager-r-1503169c pod:

Copy code

$ kubectl delete pod instance-manager-r-1503169c --context harvester003 -n longhorn-system
pod "instance-manager-r-1503169c" deleted

the hvst-upgrade apply-manifests job has moved on:

Copy code

2023-09-12T12:28:34+01:00 instance-manager-r pod count is not 1 on node harvester001, will retry...
2023-09-12T12:28:39+01:00 instance-manager-r pod count is not 1 on node harvester001, will retry...
2023-09-12T12:28:45+01:00 instance-manager-r pod image is not longhornio/longhorn-instance-manager:v1.4.3, will retry...
2023-09-12T12:28:50+01:00 Checking instance-manager-r pod on node harvester001 OK.
2023-09-12T12:28:50+01:00 Checking instance-manager-r pod on node harvester002...
2023-09-12T12:28:51+01:00 Checking instance-manager-r pod on node harvester002 OK.
2023-09-12T12:28:51+01:00 Checking instance-manager-r pod on node harvester003...
2023-09-12T12:28:51+01:00 Checking instance-manager-r pod on node harvester003 OK.
2023-09-12T12:28:51+01:00 Upgrading Managedchart rancher-monitoring-crd to 102.0.0+up40.1.2
2023-09-12T12:28:54+01:00 <http://managedchart.management.cattle.io/rancher-monitoring-crd|managedchart.management.cattle.io/rancher-monitoring-crd> patched
2023-09-12T12:28:55+01:00 <http://managedchart.management.cattle.io/rancher-monitoring-crd|managedchart.management.cattle.io/rancher-monitoring-crd> patched
2023-09-12T12:28:55+01:00 Waiting for ManagedChart fleet-local/rancher-monitoring-crd from generation 15
2023-09-12T12:28:55+01:00 Target version: 102.0.0+up40.1.2, Target state: ready
2023-09-12T12:28:56+01:00 Current version: 102.0.0+up40.1.2, Current state: OutOfSync, Current generation: 15
2023-09-12T12:29:01+01:00 Sleep for 5 seconds to retry
2023-09-12T12:29:02+01:00 Current version: 102.0.0+up40.1.2, Current state: WaitApplied, Current generation: 17
2023-09-12T12:29:07+01:00 Sleep for 5 seconds to retry
2023-09-12T12:29:08+01:00 Current version: 102.0.0+up40.1.2, Current state: WaitApplied, Current generation: 17
2023-09-12T12:29:13+01:00 Sleep for 5 seconds to retry

but appears to be stuck again.

salmon-city-57654

09/12/2023, 11:46 AM

cc @ancient-pizza-13099 could you help to check that?

prehistoric-balloon-31801

09/12/2023, 12:17 PM

@sticky-summer-13450 can you download helm and try to get the history of the

rancher-monitor-crd

chart? Thanks:

Copy code

helm history rancher-monitoring-crd -n cattle-monitoring-system

sticky-summer-13450

09/12/2023, 12:41 PM

Hi @prehistoric-balloon-31801 - sure:

Copy code

$ helm history rancher-monitoring-crd --kube-context harvester003 -n cattle-monitoring-system
REVISION	UPDATED                 	STATUS         	CHART                                  	APP VERSION	DESCRIPTION      
1174    	Sun Jun 26 07:17:18 2022	superseded     	rancher-monitoring-crd-100.1.0+up19.0.3	           	Upgrade complete 
1175    	Sun Jun 26 17:17:18 2022	superseded     	rancher-monitoring-crd-100.1.0+up19.0.3	           	Upgrade complete 
1176    	Sun Jun 26 17:17:33 2022	superseded     	rancher-monitoring-crd-100.1.0+up19.0.3	           	Upgrade complete 
1177    	Sun Jun 26 17:19:08 2022	superseded     	rancher-monitoring-crd-100.1.0+up19.0.3	           	Upgrade complete 
1178    	Sun Jun 26 17:19:24 2022	superseded     	rancher-monitoring-crd-100.1.0+up19.0.3	           	Upgrade complete 
1179    	Mon Jun 27 03:17:18 2022	superseded     	rancher-monitoring-crd-100.1.0+up19.0.3	           	Upgrade complete 
1180    	Mon Jun 27 03:17:34 2022	superseded     	rancher-monitoring-crd-100.1.0+up19.0.3	           	Upgrade complete 
1181    	Mon Jun 27 03:20:29 2022	superseded     	rancher-monitoring-crd-100.1.0+up19.0.3	           	Upgrade complete 
1182    	Mon Jun 27 03:20:44 2022	deployed       	rancher-monitoring-crd-100.1.0+up19.0.3	           	Upgrade complete 
1183    	Mon Jun 27 04:03:08 2022	pending-upgrade	rancher-monitoring-crd-100.1.0+up19.0.3	           	Preparing upgrade

prehistoric-balloon-31801

09/12/2023, 12:58 PM

Thank you. Can you get a support bundle? I think it will continue if rollback the chart, but would like to check why fleet doesn’t upgrade the chart.

sticky-summer-13450

09/12/2023, 1:02 PM

On the ticket Jain Wang is suggesting it could be because the cluster is given a LetsEncrypt TLS certificate without the IP as SAN. So I'll follow that suggestion (also in this ticket) and report the results - but I'll also start creating another Support Bundle.

👍 1

sticky-summer-13450

09/12/2023, 1:15 PM

Latest SupportBundle (before I try to work-around the TLS issue).

supportbundle_5bb44244-434e-4530-ad35-35c4ef1ff661_2023-09-12T13-05-34Z.zip

🙌 2

sticky-summer-13450

09/22/2023, 9:29 AM

Support bundle as requested here

supportbundle_5bb44244-434e-4530-ad35-35c4ef1ff661_2023-09-22T09-02-15Z.zip

🙌 1

prehistoric-balloon-31801

09/22/2023, 9:30 AM

Thanks Mark! cc @red-king-19196

👍 1

prehistoric-balloon-31801

09/22/2023, 9:53 AM

@sticky-summer-13450 I guess you delete the "Upgrade" resource right?

prehistoric-balloon-31801

09/22/2023, 9:54 AM

The one represents the v1.2.0 upgrade.

sticky-summer-13450

09/22/2023, 9:54 AM

Yes I did - in my attempts to restart the upgrade. Sorry 😞

prehistoric-balloon-31801

09/22/2023, 9:56 AM

We are the ones to say sorry for the experience. Let us check, there should be a way to bypass the check. But please note it's near Friday EOB for me and Zespre 🙂

sticky-summer-13450

09/22/2023, 9:57 AM

Thanks.

red-king-19196

09/25/2023, 11:41 AM

Hi Mark, sorry for the slow reply. We’d like to bring fleet-agent back on the right track first to see if that will relieve the whole situation and allow us to upgrade the cluster again. From the support bundle, it seems the communication between fleet-agent and the API server has some issues that causes multiple bundle deployments out of sync:

Copy code

W0922 09:01:43.872649       1 reflector.go:442] pkg/mod/github.com/rancher/client-go@v0.24.0-fleet1/tools/cache/reflector.go:167: watch of *v1alpha1.BundleDeployment ended with: an error on the server ("unable to decode an event from the watch stream: stream error: stream ID 51; INTERNAL_ERROR; received from peer") has prevented the request from succeeding

Since we don’t collect all of the secret objects on the users’ cluster with support-bundle-kit, could you help us check the content of the

fleet-agent

secret?

Copy code

kubectl -n cattle-fleet-local-system get secret fleet-agent -o jsonpath='{.data.kubeconfig}' | base64 -

There might be inconsistent in the CA and the URL. Thank you for being with us!

sticky-summer-13450

09/25/2023, 11:59 AM

I assume you mean

... | base64 -d -

to decode the data rather than encode it twice. I think I can share that data - there is a token but I don't know how secure it needs to stay ...

sticky-summer-13450

09/25/2023, 12:01 PM

Copy code

$ kubectl -n cattle-fleet-local-system --context=harvester003 get secret fleet-agent -o jsonpath='{.data.kubeconfig}' | base64 -d -
apiVersion: v1
clusters:
- cluster:
    server: <https://harvester-cluster.lan.lxiv.uk>
  name: cluster
contexts:
- context:
    cluster: cluster
    namespace: cluster-fleet-local-local-1a3d67d0a899
    user: user
  name: default
current-context: default
kind: Config
preferences: {}
users:
- name: user
  user:
    token: eyJhbGciOiJSUzI1NiIsImtpZCI6Im1ZeEFtYXppNnVjbzBoV3BxNFE0YmdKWHFQa1c4STVtVE5aeDdUNFplQ0UifQ.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJjbHVzdGVyLWZsZWV0LWxvY2FsLWxvY2FsLTFhM2Q2N2QwYTg5OSIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJyZXF1ZXN0LXptNTI0LWIzZWFmY2Q5LTIxNWItNDU1Zi04YjQ3LTFlN2Q1ZDhmNzk0Ny10b2tlbiIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50Lm5hbWUiOiJyZXF1ZXN0LXptNTI0LWIzZWFmY2Q5LTIxNWItNDU1Zi04YjQ3LTFlN2Q1ZDhmNzk0NyIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50LnVpZCI6ImJmZDlkZjI3LWUxYzctNDUyMC1hMTc1LWY4NzI1OTE1ZmZjZCIsInN1YiI6InN5c3RlbTpzZXJ2aWNlYWNjb3VudDpjbHVzdGVyLWZsZWV0LWxvY2FsLWxvY2FsLTFhM2Q2N2QwYTg5OTpyZXF1ZXN0LXptNTI0LWIzZWFmY2Q5LTIxNWItNDU1Zi04YjQ3LTFlN2Q1ZDhmNzk0NyJ9.mwEmXPlhkZbn_g0XCVwBOjO34dw0MkWw0-fLtCanJWZlbPi-cQUdDoMP3kUTqNu6KYfew-kTgA2THyVtpTVtnnYWe1gXnz4GXCqXrCNT7qLHg7zJzV0y4-2eaiM_1hJ9XToLodIMsHq7tNObDvc12fLLm91fnf17KkCuTdfYEbq9DlQi3_h2BEFCZLfN2R5T2VjBqKMJAujqZGlTmLAIYuPe4ITCk5F8dGbWfJyIOySsns9iEd8URQtSz3x44aLL37YhyMDfq-9sDiVTiw0dcG9IF2OZBRdy4vnj7ipYJyTYLxB-m7F4J7y9gS5Xb_sn8_cUS21sQ7idHds_I8Mfkw

red-king-19196

09/25/2023, 12:02 PM

oops, sorry. yeah need to decode it.

red-king-19196

09/25/2023, 12:03 PM

hmm, so there’s no

certificate-authority-data

field under the

cluster

section, just

server

red-king-19196

09/25/2023, 12:05 PM

normally, it should be something like this:

Copy code

# kubectl -n cattle-fleet-local-system get secret fleet-agent -o jsonpath='{.data.kubeconfig}' | base64 -d
apiVersion: v1
clusters:
- cluster:
    certificate-authority-data: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUJlVENDQVIrZ0F3SUJBZ0lCQURBS0JnZ3Foa2pPUFFRREFqQWtNU0l3SUFZRFZRUUREQmx5YTJVeUxYTmwKY25abGNpMWpZVUF4TmpVMk5qQTVOalUyTUI0WERUSXNRFl6TURFM01qQTFObG9YRFRNeU1EWXlOekUzTWpBMQpObG93SkRFaU1DQUdBMVVFQXd3WmNtdGxNaTF6WlhKMlpYSXRZMkZBTVRZMU5qWXdPVFkxTmpCWk1CTUdCeXFHClNNNDlBZ0VHQ0NxR1NNNDlBd0VIQTBJQUJOeHEvdU9NMS8wTlJ2eTlFM0Y0eis3NXZaVXFSNXI3REFTb3hwTWIKYzR6STZuV3VoU2grZVNqTG85SzYzOHN1TE9tZDhhM2tvMTZUT0dOZWNueEJCZE9qUWpCQU1BNEdBMVVkRHdFQgovd1FFQXdJQ3BEQVBCZ05WSFJNQkFmOEVCVEFEQVFIL01CMEdBMVVkRGdRV0JCU1Q1ckE1TXhxZnkwc21kaG1zCjZRbmN0d3RwQpBS0JnZ3Foa2pPUFFRREFnTklBREJGQWlBUTZqWUorQkFwMVh2RnRLQ0llVkVXaEc2akZiZmcKcUQ3U2J4UkQwd2tNVXdJaEFLV0RaVnZuKzF4d3IwRTliTnNqVERlVDdINEFYTGhlbHZxeCtmTTREeDl4Ci0tLS0tRU5EIENFUlRJRklDQVRFLS0tLS0K
    server: <https://10.53.0.1:443>
  name: cluster
<redacted>

red-king-19196

09/25/2023, 12:08 PM

But I remember you’re using Let’s Encrypt certificate, then the CA should be already there 🤔

sticky-summer-13450

09/25/2023, 12:56 PM

I don't know if it makes any difference, but since I'm using a full Let's Encrypt certificate I have not needed to push the

CA

into

ssl-certificates

in harvesterhci.io/v1beta1 setting, I have only needed to push the

publicCertificate

(the

fullchain.pem

from LE) and the

privateKey

(the

privkey.pem

from LE). Is that a problem / assumption somewhere? My browser already trusts the certificate so I haven't needed to push a CA into Harvester. Do the rest of the components in Harvester also trust the certificate?

red-king-19196

09/25/2023, 4:09 PM

It should be okay not to provide the CA in the

ssl-certificates

setting since it’s a well-known CA. In fact, I’d suggest removing the FQDN from the

server-url

setting because all of these fleet-agent things are meant to be internal communications from Harvester’s point of view. User-designated domain names and certificates don’t have to mess with the internal communication in design. And we’re reviewing a fix that removes the updating of

server-url

so that it won’t change the value to the VIP address during the Harvester upgrade. In addition to that, please also change the

apiServerURL

to point to your internal IP address of the

rancher

Service in the

fleet-controller

ConfigMap as a workaround. I think, in your case, it’s

10.64.0.19

. Also, fill in

apiServerCA

with the value of

internal-cacerts

setting. And finally restart the

fleet-controller

deployment after applying the changes.

Copy code

$ kubectl -n cattle-fleet-system get cm fleet-controller -o yaml
apiVersion: v1
data:
  config: |
    {
      "systemDefaultRegistry": "",
      "agentImage": "rancher/fleet-agent:v0.7.0",
      "agentImagePullPolicy": "IfNotPresent",
      "apiServerURL": "<https://10.53.138.254>", # <-- please update this field with the value from rancher service's ip address
      "apiServerCA": "LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUJ2VENDQVdPZ0F3SUJBZ0lCQURBS0JnZ3Foa2pPUFFRREFqQkdNUnd3R2dZRFZRUUtFeE5rZVc1aGJXbGoKYkdsemRHVnVaWE
l0YjNKbk1TWXdKQVlEVlFRRERCMWtlVzVoYldsamJHbHpkR1Z1WlhJdFkyRkFNVFk1TlRZeQpNekUxTlRBZUZ3MHlNekE1TWpVd05qSTFOVFZhRncwek16QTVNakl3TmpJMU5UVmFNRVl4SERBYUJnTlZCQW9UCk
UyUjVibUZ0YVdOc2FYTjBaVzVsY2kxdmNtY3hKakFrQmdOVkJBTU1IV1I1Ym1GdGFXTnNhWE4wWlc1bGNpMWoKWVVBeE5qazFOakl6TVRVMU1Ga3dFd1lIS29aSXpqMENBUVlJS29aSXpqMERBUWNEUWdBRU1QVE
lFUDV6TjBISwpVWmtwbkNXN0xwN0JoOC9TRlEwbzU3UFFQNUdzQ0l1RlhhaENpekxKWHpKbkZhRi9qTmpTSEhXUmFkaGV5YXlBCks1TzlERTZVcUtOQ01FQXdEZ1lEVlIwUEFRSC9CQVFEQWdLa01BOEdBMVVkRX
dFQi93UUZNQU1CQWY4d0hRWUQKVlIwT0JCWUVGQmY3cVpMNlhQcEFkUjJSaWd4OUNoOVh5ejJBTUFvR0NDcUdTTTQ5QkFNQ0EwZ0FNRVVDSURGWgp2MzdzaW9wUElwR2tBcjJ2MzR0bDB0Q0g3S0d0cjhEZkttUT
BTNWVnQWlFQTlxSHE4M0RNZlh3YzdObURWd3U3Cm80enJQNUJBUVU2MEpLVFFxR3piRVNVPQotLS0tLUVORCBDRVJUSUZJQ0FURS0tLS0tCg==", # <-- please update this field with the value from intenral-cacerts setting
      "agentCheckinInterval": "15m",
      "ignoreClusterRegistrationLabels": false,

sticky-summer-13450

09/25/2023, 5:26 PM

Okay... I did:

Copy code

$ kubectl -n cattle-fleet-local-system --context=harvester003 get <http://setting.management.cattle.io|setting.management.cattle.io> internal-cacerts -o jsonpath='{.value}' | base64 -w0 -

to get the

internal-cacerts

(took a while to get that to work 😵‍💫) I updated the

fleet-controller

ConfigMap with the

apiServerURL

<https://10.64.0.19>

which is the HA Harvester API (this IP address is available on my LAN, not just internal to Harvester) and

apiServerCA

as the base64 got above. Then I deleted the current

fleet-controller-*

pod. The deployment created a new pod which logged some startup stuff ending with

msg="Cluster import for 'fleet-local/local'. Deployed new agent"

. Um - what now? 🙂

red-king-19196

09/26/2023, 12:05 AM

Hmmm, it should redeploy the fleet agent with the new URL and CA. What do you see in fleet agent’s log now?

sticky-summer-13450

09/26/2023, 7:15 AM

Yes, the fleet agent was restarted and is repeatedly logging this, again.

Copy code

time="2023-09-26T07:14:31Z" level=error msg="Failed to register agent: looking up secret cattle-fleet-local-system/fleet-agent-bootstrap: Post \"<https://10.64.0.19/apis/fleet.cattle.io/v1alpha1/namespaces/fleet-local/clusterregistrations>\": tls: failed to verify certificate: x509: cannot validate certificate for 10.64.0.19 because it doesn't contain any IP SANs"

red-king-19196

09/26/2023, 7:24 AM

It must be getting the wrong CA for verifying the internal endpoint… Do you see any

fleet-agent

and

fleet-agent-bootstrap

secrets in your cluster? We could check the content of them.

sticky-summer-13450

09/26/2023, 7:28 AM

I was told to update fleet-agent-bootstrap in this thread: https://github.com/harvester/harvester/issues/4517#issuecomment-1729311789

red-king-19196

09/26/2023, 7:30 AM

The secret might be removed and re-created during fleet-controller restarts. So we might have to check the content again.

sticky-summer-13450

09/26/2023, 7:34 AM

Ah yes - the

fleet-agent-bootstrap

secret is the same age as the

fleet-agent-*

and

fleet-controller-*

pods. The CA in

fleet-agent-bootstrap

secret is the same as the value from

$ kubectl -n cattle-fleet-local-system --context=harvester003 get <http://setting.management.cattle.io|setting.management.cattle.io> internal-cacerts -o jsonpath='{.value}'

There is no

fleet-agent

secret.

red-king-19196

09/26/2023, 7:37 AM

cool, that looks good. have you emptied the

server-url

setting already?

sticky-summer-13450

09/26/2023, 7:41 AM

Ah - I'd missed that one, no I haven't. I should make that

''

now?

red-king-19196

09/26/2023, 7:41 AM

yep,

value: ""

sticky-summer-13450

09/26/2023, 7:42 AM

done

red-king-19196

09/26/2023, 7:42 AM

and check if anything happen in fleet-agent’s log

sticky-summer-13450

09/26/2023, 7:44 AM

still logging the same as before, every minute. It hasn't restarted though, so I guess the pod might need deleting?

red-king-19196

09/26/2023, 7:46 AM

yeah, we can give it a try

sticky-summer-13450

09/26/2023, 7:48 AM

same error.

sticky-summer-13450

09/26/2023, 7:50 AM

I have a thought - should the

fleet-controller

ConfigMap

apiServerURL

be the same as the

management

internal-server-url

value? Might that have the internal CA instead of the external certificate I've pushed in?

🙌 1

red-king-19196

09/26/2023, 7:52 AM

wait.. aren’t they the same value now? should be

<https://10.64.0.19>

sticky-summer-13450

09/26/2023, 7:53 AM

internal-server-url

<https://10.53.15.158>

where as

apiServerURL

is the external IP address on my lan which is

<https://10.64.0.19>

sticky-summer-13450

09/26/2023, 7:54 AM

I think I was clear above that the 10.64.0.19 is my LAN IP address, not an IP address internal to K8s in Harvester.

red-king-19196

09/26/2023, 7:54 AM

ah, my fault, i thought

10.64.0.19

is the internal VIP 🤯

red-king-19196

09/26/2023, 7:54 AM

let me think twice

red-king-19196

09/26/2023, 7:55 AM

10.64.0.19

is the management address (the IP address you filled during Harvester installation), right?

sticky-summer-13450

09/26/2023, 7:55 AM

yes - the HA (High Availability) address

red-king-19196

09/26/2023, 7:56 AM

and

10.53.15.158

is the internal cluster-ip of

rancher

service object

red-king-19196

09/26/2023, 7:56 AM

my bad lol, i mix them up

sticky-summer-13450

09/26/2023, 7:56 AM

I assume so!

sticky-summer-13450

09/26/2023, 7:56 AM

🙂

red-king-19196

09/26/2023, 7:56 AM

thanks for the notice!

red-king-19196

09/26/2023, 7:57 AM

so we have to change those ip back to

10.53.15.158

and try again!

sticky-summer-13450

09/26/2023, 7:57 AM

on it 🙂

👍 1

sticky-summer-13450

09/26/2023, 8:08 AM

fleet-controller

ConfigMap, updated,

fleet-controller-*

pod deleted,

fleet-agent-*

pod logged this:

Copy code

I0926 08:05:39.425969       1 leaderelection.go:248] attempting to acquire leader lease cattle-fleet-local-system/fleet-agent-lock...
I0926 08:05:42.421333       1 leaderelection.go:258] successfully acquired lease cattle-fleet-local-system/fleet-agent-lock
time="2023-09-26T08:05:44Z" level=info msg="Starting /v1, Kind=ConfigMap controller"
time="2023-09-26T08:05:44Z" level=info msg="Starting /v1, Kind=ServiceAccount controller"
time="2023-09-26T08:05:44Z" level=info msg="Starting /v1, Kind=Node controller"
time="2023-09-26T08:05:44Z" level=info msg="Starting /v1, Kind=Secret controller"
E0926 08:05:44.512487       1 memcache.go:206] couldn't get resource list for <http://management.cattle.io/v3|management.cattle.io/v3>: 
time="2023-09-26T08:05:44Z" level=info msg="Starting <http://fleet.cattle.io/v1alpha1|fleet.cattle.io/v1alpha1>, Kind=BundleDeployment controller"
time="2023-09-26T08:05:44Z" level=info msg="getting history for release mcc-local-managed-system-upgrade-controller"
time="2023-09-26T08:05:44Z" level=info msg="getting history for release fleet-agent-local"
time="2023-09-26T08:05:44Z" level=info msg="getting history for release local-managed-system-agent"
W0926 08:05:44.797582       1 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
W0926 08:05:44.884917       1 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
W0926 08:05:46.942805       1 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
W0926 08:05:46.953756       1 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
W0926 08:05:47.017596       1 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
W0926 08:05:47.069599       1 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
W0926 08:05:47.109470       1 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
time="2023-09-26T08:05:49Z" level=info msg="Deleting orphan bundle ID rke2, release kube-system/rke2-canal"

red-king-19196

09/26/2023, 8:09 AM

Could you check the following?

Copy code

kubectl -n fleet-local get bundles

red-king-19196

09/26/2023, 8:10 AM

The communication between the agent and rancher seems to be fixed

sticky-summer-13450

09/26/2023, 8:10 AM

👍

sticky-summer-13450

09/26/2023, 8:10 AM

Copy code

$ kubectl -n fleet-local --context=harvester003 get bundles
NAME                                          BUNDLEDEPLOYMENTS-READY   STATUS
fleet-agent-local                             1/1                       
local-managed-system-agent                    1/1                       
mcc-harvester                                 0/1                       NotReady(1) [Cluster fleet-local/local]; daemonset.apps kube-system/harvester-whereabouts [progressing] Updated: 2/3
mcc-harvester-crd                             1/1                       
mcc-local-managed-system-upgrade-controller   1/1                       
mcc-rancher-logging                           0/1                       OutOfSync(1) [Cluster fleet-local/local]
mcc-rancher-logging-crd                       0/1                       OutOfSync(1) [Cluster fleet-local/local]
mcc-rancher-monitoring                        0/1                       OutOfSync(1) [Cluster fleet-local/local]
mcc-rancher-monitoring-crd                    0/1                       WaitApplied(1) [Cluster fleet-local/local]

red-king-19196

09/26/2023, 8:11 AM

hmmm, maybe wait a bit then check again. It takes time for fleet to sync and reflect the changes.

👍 1

sticky-summer-13450

09/26/2023, 8:12 AM

I have to start thinking about work for a short while too...

red-king-19196

09/26/2023, 8:13 AM

if the statuses are still the same, might need to check fleet-agent’s log again 👀

sticky-summer-13450

09/26/2023, 8:25 AM

The

fleet-agent-*

pod logs are just ticking along logging this:

Copy code

time="2023-09-26T08:11:23Z" level=info msg="getting history for release local-managed-system-agent"
time="2023-09-26T08:16:25Z" level=info msg="getting history for release local-managed-system-agent"
time="2023-09-26T08:21:33Z" level=info msg="getting history for release local-managed-system-agent"

and so far the bundles have not changed.

red-king-19196

09/26/2023, 9:10 AM

Need to check the status of the charts. Do you have the

helm

command at your disposal?

Copy code

helm history local-managed-system-agent -n cattle-system

red-king-19196

09/26/2023, 9:38 AM

Maybe also generate a support bundle again since it’s been a while, and we changed the configuration.

sticky-summer-13450

09/26/2023, 1:10 PM

Copy code

$ helm history local-managed-system-agent -n cattle-system --kube-context harvester003
REVISION	UPDATED                 	STATUS    	CHART                                                                                            	APP VERSION	DESCRIPTION     
4953    	Mon Sep 11 07:25:32 2023	superseded	local-managed-system-agent-v0.0.0+s-e6e150e25f6da0b545e400e61d9ee74f561acf20cb9ba33fcbdc3352724f1	           	Upgrade complete
4954    	Mon Sep 11 07:25:39 2023	superseded	local-managed-system-agent-v0.0.0+s-e6e150e25f6da0b545e400e61d9ee74f561acf20cb9ba33fcbdc3352724f1	           	Upgrade complete
4955    	Mon Sep 11 17:18:56 2023	superseded	local-managed-system-agent-v0.0.0+s-e6e150e25f6da0b545e400e61d9ee74f561acf20cb9ba33fcbdc3352724f1	           	Upgrade complete
4956    	Mon Sep 11 17:19:01 2023	superseded	local-managed-system-agent-v0.0.0+s-e6e150e25f6da0b545e400e61d9ee74f561acf20cb9ba33fcbdc3352724f1	           	Upgrade complete
4957    	Mon Sep 11 17:27:18 2023	superseded	local-managed-system-agent-v0.0.0+s-e6e150e25f6da0b545e400e61d9ee74f561acf20cb9ba33fcbdc3352724f1	           	Upgrade complete
4958    	Mon Sep 11 20:44:10 2023	superseded	local-managed-system-agent-v0.0.0+s-e6e150e25f6da0b545e400e61d9ee74f561acf20cb9ba33fcbdc3352724f1	           	Upgrade complete
4959    	Mon Sep 11 20:44:16 2023	superseded	local-managed-system-agent-v0.0.0+s-d3cb9a953dd679240b86c15757006baeaa3a5072a70879194e5abbb003513	           	Upgrade complete
4960    	Mon Sep 11 20:44:27 2023	superseded	local-managed-system-agent-v0.0.0+s-d3cb9a953dd679240b86c15757006baeaa3a5072a70879194e5abbb003513	           	Upgrade complete
4961    	Mon Sep 11 20:44:31 2023	superseded	local-managed-system-agent-v0.0.0+s-d3cb9a953dd679240b86c15757006baeaa3a5072a70879194e5abbb003513	           	Upgrade complete
4962    	Mon Sep 11 20:44:33 2023	deployed  	local-managed-system-agent-v0.0.0+s-d3cb9a953dd679240b86c15757006baeaa3a5072a70879194e5abbb003513	           	Upgrade complete

sticky-summer-13450

09/26/2023, 1:10 PM

Support bundle is cooking...

red-king-19196

09/26/2023, 3:20 PM

from `cattle-system/rancher-576cf5cc45-4pv96`'s log:

Copy code

2023-09-26T13:05:13.972634730Z 2023/09/26 13:05:13 [ERROR] error syncing 'fleet-local/rancher-logging-crd': handler mcc-bundle: no chart version found for rancher-logging-crd-100.1.3+up3.17.7, requeuing
2023-09-26T13:05:19.226766849Z 2023/09/26 13:05:19 [ERROR] error syncing 'fleet-local/rancher-logging': handler mcc-bundle: no chart version found for rancher-logging-100.1.3+up3.17.7, requeuing
2023-09-26T13:05:21.936231014Z 2023/09/26 13:05:21 [ERROR] error syncing 'fleet-local/rancher-monitoring': handler mcc-bundle: no chart version found for rancher-monitoring-100.1.0+up19.0.3, requeuing
2023-09-26T13:06:01.406798090Z 2023/09/26 13:06:01 [INFO] Downloading repo index from <http://harvester-cluster-repo.cattle-system/charts/index.yaml>
2023-09-26T13:06:09.134873483Z 2023/09/26 13:06:09 [ERROR] rkecluster fleet-local/local: error while retrieving management cluster from cache: management cluster cache was nil

rancher couldn’t find the charts for those bundles because the harvester cluster repo was already upgraded. there are only new versions of the charts. the cluster is in a middle state due to the previous unsuccessful upgrade. I have an idea: since

harvester-cluster-repo

pod is just an http server serving chart files, maybe we could temporarily swap the container image tag from

v1.2.0

v1.1.2

for rancher and fleet to complete their jobs. once the status of the bundles is sorted out, we could change the image back and kick-start the upgrade again.

👀 1

🤓 1

red-king-19196

09/27/2023, 4:47 AM

A quicker way is to skip the check imposed by the webhook and start a new upgrade directly. Detailed steps are described below: 1. Create a Version object

Copy code

cat <<EOF | kubectl apply -f -
apiVersion: <http://harvesterhci.io/v1beta1|harvesterhci.io/v1beta1>
kind: Version
metadata:
  name: v1.2.0
  namespace: harvester-system
spec:

1. Create a customized Upgrade object

red-king-19196

09/27/2023, 4:53 AM

Sorry, please ignore the above. A quicker way is to skip the check imposed by the webhook and start a new upgrade directly. Detailed steps are described below: Firstly, create a Version object

Copy code

cat <<EOF | kubectl apply -f -
apiVersion: <http://harvesterhci.io/v1beta1|harvesterhci.io/v1beta1>
kind: Version
metadata:
  name: v1.2.0
  namespace: harvester-system
spec:
  isoChecksum: '267d65117f6d9601383150b4e513065e673cccba86db9a8c6e7d3cb36a04f6202162f1b95c3c545a7389c4f20f82f5fff6c6e498ff74fcb61e8513992b83e1fb'
  isoURL: <https://releases.rancher.com/harvester/v1.2.0/harvester-v1.2.0-amd64.iso>
  releaseDate: '20230908'
EOF

Then create a customized Upgrade object

Copy code

cat <<EOF | kubectl apply -f -
apiVersion: <http://harvesterhci.io/v1beta1|harvesterhci.io/v1beta1>
kind: Upgrade
metadata:
  annotations:
    <http://harvesterhci.io/skipWebhook|harvesterhci.io/skipWebhook>: "true"
  name: v1-2-0-skip-check
  namespace: harvester-system
spec:
  version: v1.2.0
EOF

🤓 1

sticky-summer-13450

09/27/2023, 7:51 AM

Thank you @red-king-19196. I can't do anything during the day today (UK time) so I'll try this approach this evening or tomorrow morning.

🙌 1

sticky-summer-13450

09/28/2023, 7:42 AM

Thanks @red-king-19196. I have applied those two objects this morning. The upgrade dialogue is now displaying but after waiting 20 minutes nothing seems to be happening. No new pods have started since the

fleet-agent-*

and

fleet-controler-*

47 hours ago.

red-king-19196

09/28/2023, 9:09 AM

Could you show us the Upgrade CR?

Copy code

kubectl -n harvester-system get upgrade v1-2-0-skip-check -o yaml

The upgrade log might have difficulty spinning up due to the error bundle. Need to confirm it.

sticky-summer-13450

09/28/2023, 4:11 PM

Copy code

$ kubectl -n harvester-system --context=harvester003 get upgrade v1-2-0-skip-check -o yaml
Error from server (NotFound): <http://plans.upgrade.cattle.io|plans.upgrade.cattle.io> "v1-2-0-skip-check" not found

sticky-summer-13450

09/28/2023, 4:33 PM

Copy code

$ kubectl --all-namespaces --context=harvester003 get upgrade
NAMESPACE       NAME                                                   IMAGE                                 CHANNEL   VERSION
cattle-system   hvst-upgrade-6hp8q-skip-restart-rancher-system-agent   <http://registry.suse.com/bci/bci-base:15.4|registry.suse.com/bci/bci-base:15.4>             23a54be8
cattle-system   sync-additional-ca                                     <http://registry.suse.com/bci/bci-base:15.4|registry.suse.com/bci/bci-base:15.4>             v1.1.0
cattle-system   system-agent-upgrader                                  rancher/system-agent                            v0.3.3-suc
cattle-system   system-agent-upgrader-windows                          rancher/wins                                    v0.4.11

red-king-19196

09/30/2023, 10:08 AM

My bad, I should’ve used the full name of the Upgrade resource. Please try it again with the following:

Copy code

kubectl -n harvester-system get <http://upgrades.harvesterhci.io|upgrades.harvesterhci.io> v1-2-0-skip-check -o yaml

👍 1

sticky-summer-13450

09/30/2023, 11:30 AM

Copy code

$ kubectl -n harvester-system --context=harvester003 get <http://upgrades.harvesterhci.io|upgrades.harvesterhci.io> v1-2-0-skip-check -o yaml
apiVersion: <http://harvesterhci.io/v1beta1|harvesterhci.io/v1beta1>
kind: Upgrade
metadata:
  annotations:
    <http://harvesterhci.io/skipWebhook|harvesterhci.io/skipWebhook>: "true"
    <http://kubectl.kubernetes.io/last-applied-configuration|kubectl.kubernetes.io/last-applied-configuration>: |
      {"apiVersion":"<http://harvesterhci.io/v1beta1|harvesterhci.io/v1beta1>","kind":"Upgrade","metadata":{"annotations":{"<http://harvesterhci.io/skipWebhook|harvesterhci.io/skipWebhook>":"true"},"name":"v1-2-0-skip-check","namespace":"harvester-system"},"spec":{"version":"v1.2.0"}}
  creationTimestamp: "2023-09-28T07:21:09Z"
  finalizers:
  - <http://wrangler.cattle.io/harvester-upgrade-controller|wrangler.cattle.io/harvester-upgrade-controller>
  generation: 2
  labels:
    <http://harvesterhci.io/latestUpgrade|harvesterhci.io/latestUpgrade>: "true"
    <http://harvesterhci.io/upgradeState|harvesterhci.io/upgradeState>: PreparingLoggingInfra
  name: v1-2-0-skip-check
  namespace: harvester-system
  resourceVersion: "939012383"
  uid: ef543f21-01c4-4256-9be4-76589b878b4d
spec:
  image: ""
  logEnabled: true
  version: v1.2.0
status:
  conditions:
  - status: Unknown
    type: Completed
  - status: Unknown
    type: LogReady
  previousVersion: v1.2.0
  upgradeLog: v1-2-0-skip-check-upgradelog

red-king-19196

10/02/2023, 2:27 AM

Looks like the faulty rancher-logging bundle was causing the upgrade log feature not to work, so the entire upgrade progress was stuck at a very beginning point. We could start the upgrade again with the

logEnabled: false

to prevent this from happening:

Copy code

# remove the stuck upgrade resource
kubectl -n harvester-system delete upgrades v1-2-0-skip-check

# create the version resource if it's missing
cat <<EOF | kubectl apply -f -
apiVersion: <http://harvesterhci.io/v1beta1|harvesterhci.io/v1beta1>
kind: Version
metadata:
  name: v1.2.0
  namespace: harvester-system
spec:
  isoChecksum: '267d65117f6d9601383150b4e513065e673cccba86db9a8c6e7d3cb36a04f6202162f1b95c3c545a7389c4f20f82f5fff6c6e498ff74fcb61e8513992b83e1fb'
  isoURL: <https://releases.rancher.com/harvester/v1.2.0/harvester-v1.2.0-amd64.iso>
  releaseDate: '20230908'
EOF

# create the upgrade resource w/ the skip-webhook annotation and log-disable toggle
cat <<EOF | kubectl apply -f -
apiVersion: <http://harvesterhci.io/v1beta1|harvesterhci.io/v1beta1>
kind: Upgrade
metadata:
  annotations:
    <http://harvesterhci.io/skipWebhook|harvesterhci.io/skipWebhook>: "true"
  name: v1-2-0-skip-check
  namespace: harvester-system
spec:
  logEnabled: false
  version: v1.2.0
EOF

And see if the upgrade proceeds.

👍 1

sticky-summer-13450

10/02/2023, 10:50 AM

Yes - that really got things moving. Thank you. Currently stuck at the first pre-draining, but I'll have a dig through the usual issues page and see if anything matches.

🙌 1

sticky-summer-13450

10/02/2023, 2:51 PM

All I needed to do was to reboot each node when it was at the

Pre-drained

state. I don't know why - I could not find any pod which was logging that it was waiting for something specific - although there was a Longhorn volume which was trying to attach in order to do a backup (which I started a week ago) but never managing to complete. I think that might have been me trying to backup my

gparted-live

machine, which is actually just a CD-ROM image. That never managed to attach so never managed to be backed-up. I don't really care - I can always remake it when I next need to check out a volume. Anyway - all has completed and I'm a very happy person. I hope that this has also helped make Harvester better for the future - which is really all that matters 🙂

👍 2

red-king-19196

10/03/2023, 2:46 AM

Glad to hear that! Just in case, could you take a look at the bundle status?

Copy code

kubectl -n fleet-local get bundles

See if there are any errors. We would like to ensure everything is fine since we applied many workarounds. Thank you for being so supportive! I’m sure we found many issues and sorted out the workarounds and solutions during the journey 🙌

sticky-summer-13450

10/03/2023, 6:39 AM

Thanks @red-king-19196.

Copy code

$ kubectl -n fleet-local --context=harvester003 get bundles
NAME                                          BUNDLEDEPLOYMENTS-READY   STATUS
fleet-agent-local                             1/1                       
local-managed-system-agent                    1/1                       
mcc-harvester                                 1/1                       
mcc-harvester-crd                             1/1                       
mcc-local-managed-system-upgrade-controller   1/1                       
mcc-rancher-logging                           0/1                       OutOfSync(1) [Cluster fleet-local/local]
mcc-rancher-logging-crd                       1/1                       
mcc-rancher-monitoring                        0/1                       OutOfSync(1) [Cluster fleet-local/local]
mcc-rancher-monitoring-crd                    1/1

👀 1

red-king-19196

10/03/2023, 6:41 AM

Could you help generate a new support bundle to let us know the current status of the cluster? Thanks!

sticky-summer-13450

10/03/2023, 6:52 AM

Here it is.

supportbundle_5bb44244-434e-4530-ad35-35c4ef1ff661_2023-10-03T06-42-45Z.zip

👌 1

red-king-19196

10/03/2023, 7:50 AM

Due to the previous apply-manifest failure, the add-on conversions of

rancher-logging

and

rancher-monitoring

were incomplete. Later rounds of upgrades just skipped the conversions because the Harvester version was already v1.2.0. However, the functionality of both charts seems fine; they still run in the previous versions. cc @ancient-pizza-13099 Do you know if doing a manual conversion for the two charts is possible?

ancient-pizza-13099

10/04/2023, 6:37 AM

converting them manually is possible

ancient-pizza-13099

10/04/2023, 6:53 AM

(1) copy the existing yaml output of managedchart

rancher-monitoring

and

rancher-logging

(2) delete the those 2 managedcharts, wait until all PODs are removed (3) create addon, note to replace some fields of

rancher-monitoring

addon, e.g. VIP

sticky-summer-13450

10/04/2023, 7:29 AM

Would the add-on being out of sync be causing extremely high load on the physical servers? My previously stable Harvester cluster has become very unstable - the stability appears to have regressed to the levels about 18 months ago. Nodes keep going off-line, causing the VMs to pause, and the volumes to become degraded in Longhorn, and the repair of the volumes causes extremely high load... and the cycle goes on.

ancient-pizza-13099

10/04/2023, 7:32 AM

seems in a bad loop did you try to stop all vm, and open Longhorn UI to check and rebuild volume/replicas

sticky-summer-13450

10/05/2023, 7:49 AM

I stopped all VMs (taking my home offline) and waited for the load-averages to reduce to around 1. I started up only the VMs I really need ("prod" k3s cluster, home-automation, Nagios, & VPN termination) and mostly things are stable, I did have one period when VMs went into a 'paused' state, but they recovered.

sticky-summer-13450

10/05/2023, 7:49 AM

The load-average on the Harvester nodes is much higher than with v1.1.2 and I cannot run the same number of VMs that I could with v1.1.2.

ancient-pizza-13099

10/05/2023, 8:16 AM

get top n process, and top n pod

ancient-pizza-13099

10/05/2023, 8:16 AM

and you may try to stop addons, like monitoring and logging

sticky-summer-13450

10/05/2023, 1:03 PM

I have already stopped all addons from the GUI, there is nothing defined in Logging and Monitoring. I do note that

prometheus

is at the top of

top

most of the time on one node. I haven't done this yet because I haven't had time to get my head around exactly what I need to do.

sticky-summer-13450

10/05/2023, 1:03 PM

An example of the top of

top

is:

Copy code

top - 12:56:03 up 2 days, 23:53,  1 user,  load average: 4.87, 3.61, 3.22
Tasks: 447 total,   1 running, 446 sleeping,   0 stopped,   0 zombie
%Cpu(s): 44.2 us,  5.7 sy,  0.0 ni, 48.9 id,  0.2 wa,  0.0 hi,  1.0 si,  0.0 st
MiB Mem : 63703.08+total, 26951.77+free, 20200.48+used, 17208.80+buff/cache
MiB Swap:    0.000 total,    0.000 free,    0.000 used. 43502.59+avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                                                                                                      
 8824 10001     20   0 1586468 664860  37160 S 107.9 1.019   1655:05 adapter                                                                                                                                                                                      
 6671 1000      20   0 6993456 2.180g 189692 S 101.7 3.505   1269:25 prometheus                                                                                                                                                                                   
 3803 root      20   0 2417388 846736  77560 S 100.3 1.298 726:05.37 rancher                                                                                                                                                                                      
 5066 root      20   0  877820 141512  55484 S 22.85 0.217 281:45.03 containerd                                                                                                                                                                                   
15088 107       20   0 13.054g 4.259g  22036 S 11.59 6.846 194:04.76 qemu-system-x86                                                                                                                                                                              
10765 root      20   0 4950184 2.477g  78820 S 8.278 3.982 914:47.21 kube-apiserver                                                                                                                                                                               
 2372 root      20   0 11.283g 440184 283816 S 7.616 0.675 474:24.04 etcd                                                                                                                                                                                         
 5430 root      20   0  906188 169212  65172 S 6.623 0.259 275:38.97 kubelet                                                                                                                                                                                      
 7661 root      20   0 1263936 417124  49976 S 5.298 0.639 264:06.55 harvester                                                                                                                                                                                    
 7816 root      20   0  968436 203888  41424 S 2.649 0.313  96:55.79 longhorn-manage                                                                                                                                                                              
 3544 root      20   0 1040556 280356  65708 S 1.656 0.430  93:23.73 kube-controller                                                                                                                                                                              
 9524 root      20   0 2271644  37524  13880 S 1.325 0.058  98:39.46 longhorn-instan                                                                                                                                                                              
    1 root      20   0  205468  17684   9632 S 0.993 0.027  46:17.48 systemd                                                                                                                                                                                      
10135 root      20   0 1903568  42744  13508 S 0.993 0.066  48:40.50 longhorn                                                                                                                                                                                     
14177 root      20   0 1829900  37756  13572 S 0.993 0.058   8:55.57 longhorn                                                                                                                                                                                     
16269 root      20   0 2181364  84836  46736 S 0.993 0.130  65:37.15 calico-node                                                                                                                                                                                  
 6897 root      20   0  723560  18196   9420 S 0.662 0.028   1:49.64 containerd-shim                                                                                                                                                                              
 7161 1001      20   0 1794104 209180  37784 S 0.662 0.321  23:58.47 virt-controller                                                                                                                                                                              
 9559 root      20   0  755312  60740  32916 S 0.662 0.093   3:33.47 harvester-netwo                                                                                                                                                                              
   21 root      20   0       0      0      0 S 0.331 0.000   1:03.77 ksoftirqd/1                                                                                                                                                                                  
   27 root      20   0       0      0      0 S 0.331 0.000  62:42.22 ksoftirqd/2                                                                                                                                                                                  
 2807 root      20   0  723560  16936   9328 S 0.331 0.026   1:39.28 containerd-shim                                                                                                                                                                              
 2865 root      20   0  723700  18396   9612 S 0.331 0.028   1:38.47 containerd-shim                                                                                                                                                                              
 3306 root      20   0  765904  76648  37272 S 0.331 0.118  12:31.75 kube-scheduler

sticky-summer-13450

10/05/2023, 1:03 PM

An example of top pods is:

Copy code

$ kubectl top pods --sort-by='cpu' --context=harvester003 --all-namespaces
NAMESPACE                   NAME                                                     CPU(cores)   MEMORY(bytes)   
harvester-system            harvester-77c7bdd669-c8cxb                               727m         1411Mi          
kube-system                 kube-apiserver-harvester002                              515m         3914Mi          
kube-system                 kube-apiserver-harvester003                              332m         3619Mi          
cattle-monitoring-system    prometheus-rancher-monitoring-prometheus-0               235m         2063Mi          
default                     virt-launcher-kube002-zpg6s                              206m         7777Mi          
kube-system                 kube-apiserver-harvester001                              204m         1757Mi          
cattle-fleet-local-system   fleet-agent-75f5945649-8f6fp                             199m         450Mi           
default                     virt-launcher-nagioskube002-th6kz                        183m         2631Mi          
default                     virt-launcher-kube003-286s2                              139m         5331Mi          
kube-system                 etcd-harvester002                                        125m         567Mi           
default                     virt-launcher-kube004-kw769                              114m         4530Mi          
default                     virt-launcher-home-assistant-jwmw5                       112m         2579Mi          
longhorn-system             instance-manager-e-1041bf96596625fc7adf7838a77ad238      91m          234Mi           
kube-system                 etcd-harvester003                                        86m          582Mi           
kube-system                 etcd-harvester001                                        83m          522Mi           
harvester-system            harvester-77c7bdd669-jwfvl                               63m          522Mi           
harvester-system            harvester-77c7bdd669-fb2wd                               60m          592Mi           
kube-system                 rke2-canal-gl52z                                         52m          243Mi           
kube-system                 rke2-canal-v2vpf                                         52m          190Mi           
cattle-monitoring-system    rancher-monitoring-operator-559767d69b-lxkkp             39m          228Mi           
longhorn-system             longhorn-manager-8pkmn                                   39m          261Mi           
kube-system                 rke2-canal-gkkw4                                         38m          191Mi           
cattle-monitoring-system    rancher-monitoring-prometheus-adapter-8846d4757-bp2gj    37m          737Mi           
longhorn-system             instance-manager-e-1e123034ec30fd9c07f37ce7446d272b      37m          117Mi           
longhorn-system             instance-manager-e-65edf6e430281d7f0bb5498a3eac3469      33m          82Mi            
longhorn-system             longhorn-manager-cbshl                                   30m          236Mi           
default                     virt-launcher-wstunnel-wireguard-95pxq                   29m          1172Mi          
cattle-system               rancher-576cf5cc45-j6kfn                                 26m          1649Mi          
kube-system                 kube-controller-manager-harvester003                     26m          264Mi           
longhorn-system             instance-manager-r-65edf6e430281d7f0bb5498a3eac3469      26m          774Mi           
longhorn-system             instance-manager-r-1041bf96596625fc7adf7838a77ad238      23m          814Mi           
longhorn-system             longhorn-manager-7ft4z                                   22m          216Mi           
longhorn-system             instance-manager-r-1e123034ec30fd9c07f37ce7446d272b      21m          838Mi           
cattle-system               rancher-576cf5cc45-5vvmr                                 19m          1012Mi          
longhorn-system             engine-image-ei-1d169b76-z8tds                           18m          19Mi            
cattle-monitoring-system    rancher-monitoring-prometheus-node-exporter-7c7xf        17m          28Mi            
longhorn-system             engine-image-ei-1d169b76-mlkrd                           15m          24Mi            
longhorn-system             engine-image-ei-1d169b76-dczbc                           14m          20Mi            
cattle-monitoring-system    rancher-monitoring-prometheus-node-exporter-4sghm        11m          27Mi            
kube-system                 rke2-metrics-server-74f878b999-gknt2                     10m          40Mi            
harvester-system            harvester-webhook-7df6c7df75-44t7r                       9m           250Mi           
harvester-system            harvester-webhook-7df6c7df75-n92ht                       8m           182Mi           
longhorn-system             longhorn-recovery-backend-fb89c6ddd-6tvz6                8m           195Mi           
harvester-system            virt-api-6dc9cc7654-sxclk                                7m           257Mi           
harvester-system            virt-controller-7468cc6d9-vq4gc                          7m           133Mi           
longhorn-system             longhorn-admission-webhook-57cf4f4689-9gkpg              7m           300Mi           
harvester-system            virt-api-6dc9cc7654-wfdsf                                7m           198Mi           
kube-system                 kube-scheduler-harvester003                              6m           50Mi            
kube-system                 kube-scheduler-harvester002                              6m           53Mi            
kube-system                 rke2-ingress-nginx-controller-x579x                      5m           201Mi           
cattle-fleet-system         fleet-controller-56786984f4-tctds                        5m           157Mi           
kube-system                 kube-scheduler-harvester001                              5m           72Mi            
cattle-logging-system       rancher-logging-root-fluentbit-xkgq6                     5m           43Mi            
harvester-system            virt-operator-77c86586f6-m8sss                           5m           210Mi           
harvester-system            harvester-network-controller-manager-68fd49b88f-gkpz4    5m           49Mi            
longhorn-system             longhorn-admission-webhook-57cf4f4689-kp7fm              5m           256Mi           
harvester-system            harvester-load-balancer-6d89b964bb-ts8sp                 5m           55Mi            
kube-system                 rke2-ingress-nginx-controller-2dv6m                      5m           260Mi           
cattle-logging-system       rancher-logging-root-fluentbit-jmmlp                     5m           42Mi            
harvester-system            virt-controller-7468cc6d9-qw6nh                          5m           211Mi           
kube-system                 rke2-ingress-nginx-controller-2rvsr                      4m           233Mi           
kube-system                 kube-controller-manager-harvester001                     4m           32Mi            
kube-system                 kube-controller-manager-harvester002                     4m           32Mi            
harvester-system            kube-vip-mhld7                                           4m           24Mi            
cattle-logging-system       rancher-logging-574448c578-sx2l2                         4m           126Mi           
longhorn-system             longhorn-recovery-backend-fb89c6ddd-mdprq                4m           246Mi           
kube-system                 cloud-controller-manager-harvester002                    4m           32Mi            
harvester-system            harvester-webhook-7df6c7df75-kj66q                       3m           153Mi           
harvester-system            harvester-network-webhook-697c754ffb-dn8x6               3m           154Mi           
kube-system                 cloud-controller-manager-harvester003                    3m           22Mi            
harvester-system            virt-handler-x4m7x                                       3m           252Mi           
cattle-monitoring-system    rancher-monitoring-kube-state-metrics-5bc8bb48bd-df45p   3m           44Mi            
cattle-fleet-system         gitjob-845b9dcc47-jzkvt                                  3m           99Mi            
cattle-logging-system       rancher-logging-root-fluentd-0                           3m           319Mi           
longhorn-system             longhorn-loop-device-cleaner-7twf6                       3m           3Mi             
harvester-system            harvester-load-balancer-webhook-6dd77c56bf-k4fgn         3m           153Mi           
harvester-system            harvester-network-controller-8kgnm                       2m           67Mi            
longhorn-system             csi-provisioner-9674b9b-rc4tk                            2m           17Mi            
kube-system                 rke2-coredns-rke2-coredns-7f75564ff4-b4gmb               2m           37Mi            
longhorn-system             csi-provisioner-9674b9b-c8q8p                            2m           22Mi            
kube-system                 cloud-controller-manager-harvester001                    2m           25Mi            
harvester-system            harvester-network-controller-2fvvc                       2m           42Mi            
longhorn-system             longhorn-conversion-webhook-678ddcc967-kwrxg             2m           202Mi           
harvester-system            kube-vip-jc5cj                                           2m           19Mi            
longhorn-system             longhorn-conversion-webhook-678ddcc967-prg2x             2m           146Mi           
longhorn-system             backing-image-manager-36c7-45eb                          2m           24Mi            
kube-system                 rke2-coredns-rke2-coredns-7f75564ff4-b7hlj               2m           33Mi            
cattle-logging-system       rancher-logging-root-fluentbit-xjgwr                     2m           51Mi            
longhorn-system             csi-resizer-76f769988f-kdmlb                             2m           20Mi            
harvester-system            virt-handler-4hnzl                                       2m           225Mi           
cattle-system               system-upgrade-controller-5685d568ff-f76b8               2m           77Mi            
cattle-system               rancher-webhook-67bd6cf65d-6zd6s                         2m           171Mi           
harvester-system            virt-handler-4tzdl                                       2m           238Mi           
harvester-system            harvester-node-disk-manager-xnzwp                        1m           31Mi            
harvester-system            kube-vip-q5jbn                                           1m           19Mi            
kube-system                 rke2-coredns-rke2-coredns-autoscaler-84d67b7c48-g79nd    1m           18Mi            
kube-system                 kube-proxy-harvester002                                  1m           31Mi            
kube-system                 kube-proxy-harvester001                                  1m           28Mi            
kube-system                 harvester-whereabouts-bsc27                              1m           28Mi            
cattle-logging-system       harvester-default-event-tailer-0                         1m           14Mi

(limited by the amount I can post in a message)

ancient-pizza-13099

10/06/2023, 6:52 AM

without comparing to v112, it is difficult to say where the more resources are used 😂

sticky-summer-13450

10/06/2023, 7:53 AM

> (1) copy the existing yaml output of managedchart

rancher-monitoring

and

rancher-logging

> (2) delete the those 2 managedcharts, wait until all PODs are removed > (3) create addon, note to replace some fields of

rancher-monitoring

addon, e.g. VIP Could you help me a bit more with this, please? I've copied the existing yaml for the ManagedCharts - here's one example.

Copy code

kubectl get managedchart rancher-monitoring -n fleet-local --context=harvester003 -o yaml > tmp/managedchart_rancher-monitoring.yaml

I've removed the managed chart - but I I don't know which Pods would be removed. For the rancher-monitoring I guessed that

prometheus-rancher-monitoring-prometheus-0

would go, but it hasn't. So my guess is wrong 😞

ancient-pizza-13099

10/06/2023, 8:00 AM

kubectl get managedchart -A

kubectl get <http://addons.harvesterhci.io|addons.harvesterhci.io> -A

, check if

rancher-monitoring

addons is enabled

ancient-pizza-13099

10/06/2023, 8:01 AM

rancher-monitoring

is not there, then

kubectl get deployment -n cattle-monitoring-system

, and get replicaset, then kubectl delete them

ancient-pizza-13099

10/06/2023, 8:01 AM

just wait

ancient-pizza-13099

10/06/2023, 8:02 AM

kubectl get <http://addons.harvesterhci.io|addons.harvesterhci.io> -A

, check if

rancher-monitoring

addons is enabled

5 Views

Open in Slack

Previous Next