https://rancher.com/ logo
Title
s

sticky-summer-13450

07/11/2022, 1:29 PM
All of the metrics in Harvester's dashboard say there is no data. I don't know when it started happening, but I went to look just now and it's all blank, and no data.
I thought I'd give the Prometheus pod a kick, so I deleted it, but it has not restarted claiming:
Unable to attach or mount volumes: unmounted volumes=[prometheus-rancher-monitoring-prometheus-db], unattached volumes=[prometheus-rancher-monitoring-prometheus-db nginx-home config-out tls-assets web-config prometheus-rancher-monitoring-prometheus-rulefiles-0 config kube-api-access-hwjgz prometheus-nginx]: timed out waiting for the condition
Now I'm stuck
Humm - looking at the Longhorn Dashboard there are two degraded volumes. One of which I think may be the the Prometheus volume above.
I don't know what to do to stop it being degraded!
I guess Longhorn is supposed to rebuild the replicas its self - but it's stuck at 85% complete, and I don't know what to do.
g

great-bear-19718

07/12/2022, 9:58 AM
are you please able to generate and share a support bundle?
i can try and take a look at the logs and see what is going on
s

sticky-summer-13450

07/12/2022, 5:22 PM
Oh - sure, thanks. I'll do that when I get home after work.
g

great-bear-19718

07/20/2022, 7:11 AM
the issue that prometheus cant attach the pvc.. so there is no working monitoring at the time support bundle was generated
what is the cpu usage on the harvester001 node?
there should be a pod..
instance-manager-e-d920c0a2
on harvester001, are you able to delete it? longhorn should schedule another one and it should kick the volume back into action
s

sticky-summer-13450

07/20/2022, 2:35 PM
The load in Harvester GUI does not look bad (image below) and the CPU use is okay
%Cpu(s): 19.4 us, 16.7 sy,  0.0 ni,  0.0 id, 63.6h wa,  0.0 hi,  0.4 si,  0.0 st
But now you come to mention it, the load average is over 3000, I hadn't spotted that... There are over 3000 istances of blkid doing not-a-lot.
root     32750  9360  0 Jul17 ?        00:00:00 blkid -p -s TYPE -s PTTYPE -o export /dev/longhorn/pvc-a5b5fe4c-eca4-4c97-a3db-f9490980c044
so I did
~> sudo /var/lib/rancher/rke2/bin/kubectl --kubeconfig=/etc/rancher/rke2/rke2.yaml delete pod instance-manager-e-d920c0a2 --namespace=longhorn-system
pod "instance-manager-e-d920c0a2" deleted
and all those processes have gone and the load average is coming down. I'm currently doing this remotely - sshed via a bastion with some port forwarding, and I can't remember how to view the Longhorn UI remotely, so I'll check that out when I get back home.
And the stats are back in Harvester. Thanks.
And I've remembered how to view the Longhorn UI from the Harvester UI.
There's another degraded volume in Longhorn associated with an mssql database -virt-launcher-mssql-bznzq - do you have advice for stopping it being degraded too?
g

great-bear-19718

07/20/2022, 11:54 PM
Conditions:
    Restore:
      Last Transition Time:  2022-02-13T21:57:33Z
      Status:                False
      Type:                  restore
    Scheduled:
      Last Transition Time:  2022-07-03T15:58:15Z
      Status:                True
      Type:                  scheduled
    Toomanysnapshots:
      Last Transition Time:  2022-07-03T05:26:56Z
      Message:               Snapshots count is 248 over the warning threshold 100
      Reason:                TooManySnapshots
      Status:                True
      Type:                  toomanysnapshots
i suspect that is likely to be due to too many snapshots
s

sticky-summer-13450

07/21/2022, 7:56 AM
I have never made a snapshot. As far as I am concerned the Longhorn storage is, and should be, a black box in Harvester. Yes, according to the Longhorn UI, between 2022-06-27T04🔟11Z and 2022-06-27T13:55:17Z hundreds of "system hidden" snapshots were created. I was not doing anything to the system at that time of night! :-) What should I do to remediate that?