[prometheus-users] Re: Offset alert never clearing

'Brian Candler' via Prometheus Users Fri, 13 Dec 2024 04:00:50 -0800

> I do not really understand how expr works in prom rules - is it something 
that simply evaluates to either 1 or 'true' as a go bool type?

No. It's not boolean logic at all.

PromQL works with *vectors*: a vector contains zero or more values, each 
with a distinct set of labels. An alert fires whenever the vector is 
non-empty, regardless of the value. That is, a value of 0 triggers an alert 
just as much as a value of 1000. It's the presence or absence of a value 
which controls alerting.

Take, for example, the promql query "foo". It might return the following, 
all current values of metric foo:

foo{instance="aaa"} 7
foo{instance="bbb"} 3
foo{instance="ccc"} 1

That's a vector with three values.

Now take the promql query "foo > 2". It returns a vector with 2 values:

foo{instance="aaa"} 7
foo{instance="bbb"} 3

If you use "foo > 2" as an alerting expression, then you'll have two alerts 
firing.  If the value of foo{instance="bbb"} drops to 2 or less, then the 
alerting expression returns an instant vector with only one value, so the 
bbb alert resolves, but the aaa alert continues.

This is the reason why "resolved" messages show the most recent value which 
triggered the alert, not the current (non-alerting) value. The current 
value is below the threshold, so is filtered out entirely from the PromQL 
results.

Now, an expression like count({__name__=~"tcpsocket(.+)Inbound"}) also 
gives a vector as its result. If there are no timeseries inside the 
parentheses, then it is the empty vector. If there are one or more 
timeseries, then you get a single-element vector containing a single value 
(which is the count of timeseries) and an empty label set.  You can try 
this for yourself in the PromQL query browser:

count({__name__=~"blah_nonexistent(.*)"})   #   empty result
count({__name__=~"node_filesystem(.*)"})    #    {} 1234   where {} means 
"empty label set"

Now, when you do a binary operation between two vector values, by default 
the result vector has one entry for every label set which matches exactly 
between the LHS and RHS vectors. Any label set on the LHS which is not 
matched on the RHS, or vice versa, is discarded and gives no value in the 
result vector.  But in this case, since the LHS and RHS will (almost) 
always have a single entry with empty label set, it will match.

Therefore, what I think you want is simply:

expr: count({__name__=~"tcpsocket(.+)Inbound"}) offset 30s != 
count({__name__=~"tcpsocket(.+)Inbound"})

That should do what you want *unless* __name__=~"tcpsocket(.+)Inbound" 
matches no timeseries at all, in which case the vector will be empty (on 
either the LHS or the RHS) and therefore the count() will be empty, and 
there's nothing to match to the other side.  If this is an important case 
for you then you can fake up a vector with empty labels:

expr: count({__name__=~"tcpsocket(.+)Inbound"}) offset 30s != 
count({__name__=~"tcpsocket(.+)Inbound"}) 
or vector(0)

Again, PromQL's "or" operator doesn't behave like boolean expression. What 
"or" does is to match the vectors on the LHS and the RHS:
- for any value on the LHS, use the value and label set from the LHS in the 
result (whether or not it matches something in the RHS)
- for any value on the RHS, whose label set does not exist in the LHS, then 
add it to the result.

vector(0) is a static value: an instant vector containing one element whose 
label set is empty with value 0.  So if the previous expression doesn't 
contain an element with empty label set, "... or vector(0)" will add it to 
the result, and that will trigger the alert (with value 0).

On Friday, 13 December 2024 at 09:39:02 UTC cam wrote:

> This took about a week to appear on the list? Meantime, I have come up 
> with the following.. 
>
>   - alert: outboundSocketCountChange
>     expr: *(*(count({__name__=~"tcpsocket(.+)Inbound"} offset 30s) - 
> count({__name__=~"tcpsocket(.+)Inbound"})) != bool 0*) == 1*
>
>     labels:
>       severity: critical
>     annotations:
>       summary: OB socket count has changed
>
> This does what I need but it makes me think I do not really understand how 
> expr works in prom rules - is it something that simply evaluates to either 
> 1 or 'true' as a go bool type?
>
> c
>
> On Friday, 13 December 2024 at 08:49:33 UTC cam wrote:
>
>> Hello all,
>>
>> I have a rule which is trying to count time series that match a certain 
>> regexp and spot when this changes, to raise an alert more or less 
>> immediately (i.e. no for clause). This is counting a custom socket count 
>> metric that we need to catch any changes in.
>>
>>   - alert: outboundSocketCountChange
>>     expr: (count({__name__=~"tcpsocket(.+)Inbound"} offset 30s) - 
>> count({__name__=~"tcpsocket(.+)Inbound"})) != bool 0
>>     labels:
>>       severity: critical
>>     annotations:
>>       summary: OB socket count has changed
>>
>> It triggers fine when the value changes but it appears to then be stuck 
>> in firing, rather than resolving when the next evaluation window completes. 
>> Graphing the promQL shows exactly what I would expect - a single spike to 1 
>> when the value changes and then back to zero. I would expect the alert to 
>> clear when it hits that zero.
>>
>> Scrape and evaluation intervals are both set to 15s. Prom v2.45.
>>
>> Am I missing something here? 
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion visit 
https://groups.google.com/d/msgid/prometheus-users/77fed316-4283-4fc3-98d9-99bcf630e37bn%40googlegroups.com.

[prometheus-users] Re: Offset alert never clearing

Reply via email to