This is incredibly helpful, thanks for taking the time to write it. I don't think there is anything like this level of description of how expr works in the docs, but I may have missed it.
You also correctly anticipated that the missing-time-series scenario was an issue for me in this work, so thanks for that too. cam On Fri, 13 Dec 2024 at 12:00, 'Brian Candler' via Prometheus Users < [email protected]> wrote: > > I do not really understand how expr works in prom rules - is it > something that simply evaluates to either 1 or 'true' as a go bool type? > > No. It's not boolean logic at all. > > PromQL works with *vectors*: a vector contains zero or more values, each > with a distinct set of labels. An alert fires whenever the vector is > non-empty, regardless of the value. That is, a value of 0 triggers an alert > just as much as a value of 1000. It's the presence or absence of a value > which controls alerting. > > Take, for example, the promql query "foo". It might return the following, > all current values of metric foo: > > foo{instance="aaa"} 7 > foo{instance="bbb"} 3 > foo{instance="ccc"} 1 > > That's a vector with three values. > > Now take the promql query "foo > 2". It returns a vector with 2 values: > > foo{instance="aaa"} 7 > foo{instance="bbb"} 3 > > If you use "foo > 2" as an alerting expression, then you'll have two > alerts firing. If the value of foo{instance="bbb"} drops to 2 or less, > then the alerting expression returns an instant vector with only one value, > so the bbb alert resolves, but the aaa alert continues. > > This is the reason why "resolved" messages show the most recent value > which triggered the alert, not the current (non-alerting) value. The > current value is below the threshold, so is filtered out entirely from the > PromQL results. > > Now, an expression like count({__name__=~"tcpsocket(.+)Inbound"}) also > gives a vector as its result. If there are no timeseries inside the > parentheses, then it is the empty vector. If there are one or more > timeseries, then you get a single-element vector containing a single value > (which is the count of timeseries) and an empty label set. You can try > this for yourself in the PromQL query browser: > > count({__name__=~"blah_nonexistent(.*)"}) # empty result > count({__name__=~"node_filesystem(.*)"}) # {} 1234 where {} means > "empty label set" > > Now, when you do a binary operation between two vector values, by default > the result vector has one entry for every label set which matches exactly > between the LHS and RHS vectors. Any label set on the LHS which is not > matched on the RHS, or vice versa, is discarded and gives no value in the > result vector. But in this case, since the LHS and RHS will (almost) > always have a single entry with empty label set, it will match. > > Therefore, what I think you want is simply: > > expr: count({__name__=~"tcpsocket(.+)Inbound"}) offset 30s != > count({__name__=~"tcpsocket(.+)Inbound"}) > > That should do what you want *unless* __name__=~"tcpsocket(.+)Inbound" > matches no timeseries at all, in which case the vector will be empty (on > either the LHS or the RHS) and therefore the count() will be empty, and > there's nothing to match to the other side. If this is an important case > for you then you can fake up a vector with empty labels: > > expr: count({__name__=~"tcpsocket(.+)Inbound"}) offset 30s != > count({__name__=~"tcpsocket(.+)Inbound"}) > or vector(0) > > Again, PromQL's "or" operator doesn't behave like boolean expression. What > "or" does is to match the vectors on the LHS and the RHS: > - for any value on the LHS, use the value and label set from the LHS in > the result (whether or not it matches something in the RHS) > - for any value on the RHS, whose label set does not exist in the LHS, > then add it to the result. > > vector(0) is a static value: an instant vector containing one element > whose label set is empty with value 0. So if the previous expression > doesn't contain an element with empty label set, "... or vector(0)" will > add it to the result, and that will trigger the alert (with value 0). > > On Friday, 13 December 2024 at 09:39:02 UTC cam wrote: > >> This took about a week to appear on the list? Meantime, I have come up >> with the following.. >> >> - alert: outboundSocketCountChange >> expr: *(*(count({__name__=~"tcpsocket(.+)Inbound"} offset 30s) - >> count({__name__=~"tcpsocket(.+)Inbound"})) != bool 0*) == 1* >> >> labels: >> severity: critical >> annotations: >> summary: OB socket count has changed >> >> This does what I need but it makes me think I do not really understand >> how expr works in prom rules - is it something that simply evaluates to >> either 1 or 'true' as a go bool type? >> >> c >> >> On Friday, 13 December 2024 at 08:49:33 UTC cam wrote: >> >>> Hello all, >>> >>> I have a rule which is trying to count time series that match a certain >>> regexp and spot when this changes, to raise an alert more or less >>> immediately (i.e. no for clause). This is counting a custom socket count >>> metric that we need to catch any changes in. >>> >>> - alert: outboundSocketCountChange >>> expr: (count({__name__=~"tcpsocket(.+)Inbound"} offset 30s) - >>> count({__name__=~"tcpsocket(.+)Inbound"})) != bool 0 >>> labels: >>> severity: critical >>> annotations: >>> summary: OB socket count has changed >>> >>> It triggers fine when the value changes but it appears to then be stuck >>> in firing, rather than resolving when the next evaluation window completes. >>> Graphing the promQL shows exactly what I would expect - a single spike to 1 >>> when the value changes and then back to zero. I would expect the alert to >>> clear when it hits that zero. >>> >>> Scrape and evaluation intervals are both set to 15s. Prom v2.45. >>> >>> Am I missing something here? >>> >> -- > You received this message because you are subscribed to a topic in the > Google Groups "Prometheus Users" group. > To unsubscribe from this topic, visit > https://groups.google.com/d/topic/prometheus-users/AfVOhJ5rfOg/unsubscribe > . > To unsubscribe from this group and all its topics, send an email to > [email protected]. > To view this discussion visit > https://groups.google.com/d/msgid/prometheus-users/77fed316-4283-4fc3-98d9-99bcf630e37bn%40googlegroups.com > <https://groups.google.com/d/msgid/prometheus-users/77fed316-4283-4fc3-98d9-99bcf630e37bn%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- ............................................................ [email protected] -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion visit https://groups.google.com/d/msgid/prometheus-users/CAD6vR%2BT2%2BAtB3%3DxdR08tXoKxqLE-i3Q_iWWznhScebT-%2BWahnQ%40mail.gmail.com.

