This message was deleted Rancher Users #longhorn-storage

Join Slack

This message was deleted.

# longhorn-storage

adamant-kite-43734

03/06/2024, 8:17 PM

This message was deleted.

bland-article-62755

03/06/2024, 8:25 PM

As I read this, I see it as: longhorn_backup_state is greater than or equal to 4

bland-article-62755

03/06/2024, 8:25 PM

What's the goal here?

bland-article-62755

03/06/2024, 8:25 PM

You want a report of backups that have failed?

bland-article-62755

03/06/2024, 8:26 PM

Or backups that are taking a long time?

bland-article-62755

03/06/2024, 8:31 PM

maybe you need the offset modifier?

bland-article-62755

03/06/2024, 8:31 PM

(I'm absolutely guessing here)

bland-article-62755

03/06/2024, 8:32 PM

For example, the following expression returns the value of
http_requests_total
5 minutes in the past relative to the current query evaluation time:

Copy code

http_requests_total offset 5m

bland-article-62755

03/06/2024, 8:35 PM

I'm slightly worried that the longhorn_backup_state is either a string or that it's going to evaluate how many results there are instead of looking for backups that were created in the past 10 minutes that have failed.

bland-article-62755

03/06/2024, 8:38 PM

https://github.com/longhorn/longhorn/blob/d2b7616691e2d2f89be05004b41cfc48ea342df8/chart/templates/crds.yaml#L672

bland-article-62755

03/06/2024, 8:40 PM

also from the crds: >

Copy code

state:
>                 description: The backup creation state. Can be "", "InProgress", "Completed", "Error", "Unknown".
>                 type: string

bland-article-62755

03/06/2024, 8:42 PM

I dunno. But hopefully one of those might be in the right direction. ¯\_(ツ)_/¯

curved-piano-98970

03/07/2024, 7:44 AM

Hello there Brian

curved-piano-98970

03/07/2024, 7:44 AM

Sorry but i guess a bit of timezone difference

curved-piano-98970

03/07/2024, 7:44 AM

or me completely disregarding slack, sorry

curved-piano-98970

03/07/2024, 7:44 AM

The thing is yes, the backups can be in one of those 5 states

curved-piano-98970

03/07/2024, 7:46 AM

State 0,1,2,3 are ok states for me but state4 and 5 I want prometheus to fire an alarm. The thing is I might have to account for a backup that is older than 24h if i don't run the query through any time manipulation, because I can have a failed backup 48h old but i don't want prometheus to fire for that @bland-article-62755

👍 1

curved-piano-98970

03/07/2024, 7:51 AM

so you'd suggest to use longhorn_backup_state offset 5m ?

bland-article-62755

03/07/2024, 3:26 PM

That's what I'd try, but I wouldn't have guessed that you could substitute ints for the list. If you could, then Error would be

and

would match the

Unknown

string, but I don't think it works that way. (I reserve the right to be wrong as I actually have no idea - it just doesn't make sense to me)

bland-article-62755

03/07/2024, 3:51 PM

I think this is relevant for what you're trying to do. I think what you want might be...

count ( longhorn_backup_state offset 10m =~ "Error|Unknown" ) > 0

bland-article-62755

03/07/2024, 3:53 PM

As I read it, it would count the number of entries that the search ( backup states from the last 10 minutes that have a state of "Error" or "Unknown" ) and if there's more results than

- Fire an alert.

bland-article-62755

03/07/2024, 3:58 PM

I think that

longhorn_backup_state[10m]

is shorthand for the offset, so that might work too.

bland-article-62755

03/07/2024, 3:59 PM

could be

count(longhorn_backup_state[10m]=~"Error|Unkown")>0

works too.

curved-piano-98970

03/08/2024, 11:09 AM

Thank you brian

curved-piano-98970

03/08/2024, 11:10 AM

But your promql query i think won't work because the state of the backup changes

curved-piano-98970

03/08/2024, 11:10 AM

I tried it

curved-piano-98970

03/08/2024, 11:11 AM

I don't have the log for the backup name xyz that changes from state 0 to state 1 to state 2 to 3, it is directly 1 or 2 or 3...

curved-piano-98970

03/08/2024, 11:11 AM

But i ended up going with something like max(max_over_time(longhorn_backup_state[24h])) by (backup, volume) == 3

curved-piano-98970

03/08/2024, 11:12 AM

I might have one more question but it's more prometheus related but maybe you can help..

curved-piano-98970

03/08/2024, 11:13 AM

I added to the default service monitor the following snippet:

Copy code

- metricRelabelings:
    - sourceLabels: [__tmp]
      regex: '(.*)'
      replacement: 'cloudfire-stage-cortex'
      targetLabel: k8s_cluster
      action: replace

How come the prometheus metrics get duplicated? If i leave it alone i find just one servicediscovery targets inside prometheus, if i add this now i have two entries.. for the same backup.. what am i doing wrong?

bland-article-62755

03/08/2024, 3:08 PM

by (backup, volume)

meaning it's checking both a backup object and a volume object so the volume foo says backup bar is messed up. Also the backup object bar says "i'm messed up" so there's two entries?

bland-article-62755

03/08/2024, 3:08 PM

idk, just a best guess.

2 Views

Open in Slack

Previous Next