This message was deleted.
# longhorn-storage
a
This message was deleted.
b
As I read this, I see it as: longhorn_backup_state is greater than or equal to 4
What's the goal here?
You want a report of backups that have failed?
Or backups that are taking a long time?
maybe you need the offset modifier?
(I'm absolutely guessing here)
For example, the following expression returns the value of
http_requests_total
5 minutes in the past relative to the current query evaluation time:
Copy code
http_requests_total offset 5m
I'm slightly worried that the longhorn_backup_state is either a string or that it's going to evaluate how many results there are instead of looking for backups that were created in the past 10 minutes that have failed.
also from the crds: >
Copy code
state:
>                 description: The backup creation state. Can be "", "InProgress", "Completed", "Error", "Unknown".
>                 type: string
I dunno. But hopefully one of those might be in the right direction. ¯\_(ツ)_/¯
c
Hello there Brian
Sorry but i guess a bit of timezone difference
or me completely disregarding slack, sorry
The thing is yes, the backups can be in one of those 5 states
State 0,1,2,3 are ok states for me but state4 and 5 I want prometheus to fire an alarm. The thing is I might have to account for a backup that is older than 24h if i don't run the query through any time manipulation, because I can have a failed backup 48h old but i don't want prometheus to fire for that @bland-article-62755
👍 1
so you'd suggest to use longhorn_backup_state offset 5m ?
b
That's what I'd try, but I wouldn't have guessed that you could substitute ints for the list. If you could, then Error would be
3
and
4
would match the
Unknown
string, but I don't think it works that way. (I reserve the right to be wrong as I actually have no idea - it just doesn't make sense to me)
I think this is relevant for what you're trying to do. I think what you want might be...
count ( longhorn_backup_state offset 10m =~ "Error|Unknown" ) > 0
As I read it, it would count the number of entries that the search ( backup states from the last 10 minutes that have a state of "Error" or "Unknown" ) and if there's more results than
0
- Fire an alert.
I think that
longhorn_backup_state[10m]
is shorthand for the offset, so that might work too.
could be
count(longhorn_backup_state[10m]=~"Error|Unkown")>0
works too.
c
Thank you brian
But your promql query i think won't work because the state of the backup changes
I tried it
I don't have the log for the backup name xyz that changes from state 0 to state 1 to state 2 to 3, it is directly 1 or 2 or 3...
But i ended up going with something like max(max_over_time(longhorn_backup_state[24h])) by (backup, volume) == 3
I might have one more question but it's more prometheus related but maybe you can help..
I added to the default service monitor the following snippet:
Copy code
- metricRelabelings:
    - sourceLabels: [__tmp]
      regex: '(.*)'
      replacement: 'cloudfire-stage-cortex'
      targetLabel: k8s_cluster
      action: replace
How come the prometheus metrics get duplicated? If i leave it alone i find just one servicediscovery targets inside prometheus, if i add this now i have two entries.. for the same backup.. what am i doing wrong?
b
by (backup, volume)
meaning it's checking both a backup object and a volume object so the volume foo says backup bar is messed up. Also the backup object bar says "i'm messed up" so there's two entries?
idk, just a best guess.