This message was deleted Rancher Users #harvester

Join Slack

This message was deleted.

# harvester

adamant-kite-43734

07/11/2022, 1:29 PM

This message was deleted.

sticky-summer-13450

07/11/2022, 1:32 PM

I thought I'd give the Prometheus pod a kick, so I deleted it, but it has not restarted claiming:

Copy code

Unable to attach or mount volumes: unmounted volumes=[prometheus-rancher-monitoring-prometheus-db], unattached volumes=[prometheus-rancher-monitoring-prometheus-db nginx-home config-out tls-assets web-config prometheus-rancher-monitoring-prometheus-rulefiles-0 config kube-api-access-hwjgz prometheus-nginx]: timed out waiting for the condition

sticky-summer-13450

07/11/2022, 1:34 PM

Now I'm stuck

sticky-summer-13450

07/11/2022, 1:44 PM

Humm - looking at the Longhorn Dashboard there are two degraded volumes. One of which I think may be the the Prometheus volume above.

sticky-summer-13450

07/11/2022, 1:45 PM

I don't know what to do to stop it being degraded!

sticky-summer-13450

07/11/2022, 2:43 PM

I guess Longhorn is supposed to rebuild the replicas its self - but it's stuck at 85% complete, and I don't know what to do.

great-bear-19718

07/12/2022, 9:58 AM

are you please able to generate and share a support bundle?

great-bear-19718

07/12/2022, 9:58 AM

i can try and take a look at the logs and see what is going on

sticky-summer-13450

07/12/2022, 5:22 PM

Oh - sure, thanks. I'll do that when I get home after work.

sticky-summer-13450

07/16/2022, 6:49 PM

Sorry to be slow - here’s the support bundle

supportbundle_5bb44244-434e-4530-ad35-35c4ef1ff661_2022-07-16T18-43-01Z.zip

sticky-summer-13450

07/16/2022, 7:03 PM

https://github.com/harvester/harvester/issues/2513

great-bear-19718

07/20/2022, 7:11 AM

the issue that prometheus cant attach the pvc.. so there is no working monitoring at the time support bundle was generated

great-bear-19718

07/20/2022, 7:11 AM

what is the cpu usage on the harvester001 node?

great-bear-19718

07/20/2022, 7:14 AM

there should be a pod..

instance-manager-e-d920c0a2

on harvester001, are you able to delete it? longhorn should schedule another one and it should kick the volume back into action

sticky-summer-13450

07/20/2022, 2:35 PM

The load in Harvester GUI does not look bad (image below) and the CPU use is okay

Copy code

%Cpu(s): 19.4 us, 16.7 sy,  0.0 ni,  0.0 id, 63.6h wa,  0.0 hi,  0.4 si,  0.0 st

But now you come to mention it, the load average is over 3000, I hadn't spotted that... There are over 3000 istances of blkid doing not-a-lot.

Copy code

root     32750  9360  0 Jul17 ?        00:00:00 blkid -p -s TYPE -s PTTYPE -o export /dev/longhorn/pvc-a5b5fe4c-eca4-4c97-a3db-f9490980c044

so I did

Copy code

~> sudo /var/lib/rancher/rke2/bin/kubectl --kubeconfig=/etc/rancher/rke2/rke2.yaml delete pod instance-manager-e-d920c0a2 --namespace=longhorn-system
pod "instance-manager-e-d920c0a2" deleted

and all those processes have gone and the load average is coming down. I'm currently doing this remotely - sshed via a bastion with some port forwarding, and I can't remember how to view the Longhorn UI remotely, so I'll check that out when I get back home.

sticky-summer-13450

07/20/2022, 2:52 PM

And the stats are back in Harvester. Thanks.

sticky-summer-13450

07/20/2022, 2:57 PM

And I've remembered how to view the Longhorn UI from the Harvester UI.

sticky-summer-13450

07/20/2022, 2:58 PM

There's another degraded volume in Longhorn associated with an mssql database -virt-launcher-mssql-bznzq - do you have advice for stopping it being degraded too?

great-bear-19718

07/20/2022, 11:54 PM

Copy code

Conditions:
    Restore:
      Last Transition Time:  2022-02-13T21:57:33Z
      Status:                False
      Type:                  restore
    Scheduled:
      Last Transition Time:  2022-07-03T15:58:15Z
      Status:                True
      Type:                  scheduled
    Toomanysnapshots:
      Last Transition Time:  2022-07-03T05:26:56Z
      Message:               Snapshots count is 248 over the warning threshold 100
      Reason:                TooManySnapshots
      Status:                True
      Type:                  toomanysnapshots

i suspect that is likely to be due to too many snapshots

sticky-summer-13450

07/21/2022, 7:56 AM

I have never made a snapshot. As far as I am concerned the Longhorn storage is, and should be, a black box in Harvester. Yes, according to the Longhorn UI, between 2022-06-27T041011Z and 2022-06-27T135517Z hundreds of "system hidden" snapshots were created. I was not doing anything to the system at that time of night! :-) What should I do to remediate that?

30 Views

Open in Slack

Previous Next