[prometheus-users] Re: what to do about flapping alerts?

'Brian Candler' via Prometheus Users Sat, 06 Apr 2024 00:33:32 -0700

> but AFAIU that would simply affect all alerts, i.e. it wouldn't just keep 
firing, when the scraping failed, but also when it actually goes back to an 
ok state, right?

It affects all alerts individually, and I believe it's exactly what you 
want. A brief flip from "failing" to "OK" doesn't resolve the alert; it 
only resolves if it has remained in the "OK" state for the keep_firing_for 
duration. Therefore you won't get a fresh alert until it's been OK for at 
least keep_firing_for and *then* fails again.

As you correctly surmise, an alert isn't really a boolean condition, it's a 
presence/absence condition: the expr returns a vector of 0 or more alerts, 
each with a unique combination of labels.  "keep_firing_for" retains a 
particular labelled value in the vector for a period of time even if it's 
no longer being generated by the alerting "expr".  Hence if it does 
reappear in the expr output during that time, it's just a continuation of 
the previous alert.

> Similarly, when a node goes completely down (maintenance or so) and then 
up again, all alerts would then start again to fire (and even a generous 
keep_firing_for would have been exceeded)... and send new notifications.

I don't understand what you're saying here. Can you give some specific 
examples?

If you have an alerting expression like "up == 0" and you take 10 machines 
down then your alerting expression will return a vector of ten zeros and 
this will generate ten alerts (typically grouped into a single 
notification, if you use the default alertmanager config)

When they revert to up == 1 then they won't "start again to fire", because 
they were already firing. Indeed, it's almost the opposite. Let's say you 
have keep_firing_for: 10m, then if any machine goes down in the 10 minutes 
after the end of maintenance then it *won't* generate a new alert, because 
it will just be a continuation of the old one.

However, when you're doing maintenance, you might also be using silences to 
prevent notifications. In that case you might want your silence to extend 
10 minutes past the end of the maintenance period.

On Saturday 6 April 2024 at 04:03:07 UTC+1 Christoph Anton Mitterer wrote:

> Hey.
>
> I have some simple alerts like:
>     - alert: node_upgrades_non-security_apt
>       expr:  'sum by (instance,job) ( 
> apt_upgrades_pending{origin!~"(?i)^.*-security(?:\\PL.*)?$"} )'
>     - alert: node_upgrades_security_apt
>       expr:  'sum by (instance,job) ( 
> apt_upgrades_pending{origin=~"(?i)^.*-security(?:\\PL.*)?$"} )'
>
> If there's no upgrades, these give no value.
> Similarly, for all other simple alerts, like free disk space:
> 1 - node_filesystem_avail_bytes{mountpoint="/", fstype!="rootfs", 
> instance!~"(?i)^.*\\.garching\\.physik\\.uni-muenchen\\.de$"} / 
> node_filesystem_size_bytes  >  0.80
>
> No value => all ok, some value => alert.
>
> I do have some instances which are pretty unstable (i.e. scraping fails 
> every know and then - or more often than that), which are however mostly 
> out of my control, so I cannot do anything about that.
>
> When the target goes down, the alert clears and as soon as it's back, it 
> pops up again, sending a fresh alert notification.
>
> Now I've seen:
> https://github.com/prometheus/prometheus/pull/11827
> which describes keep_firing_for as "the minimum amount of time that an 
> alert should remain firing, after the expression does not return any 
> results", respectively in 
> https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/#rule
>  
> :
> # How long an alert will continue firing after the condition that 
> triggered it # has cleared. [ keep_firing_for: <duration> | default = 0s ] 
>
> but AFAIU that would simply affect all alerts, i.e. it wouldn't just keep 
> firing, when the scraping failed, but also when it actually goes back to an 
> ok state, right?
> That's IMO however rather undesirable.
>
> Similarly, when a node goes completely down (maintenance or so) and then 
> up again, all alerts would then start again to fire (and even a generous 
> keep_firing_for would have been exceeded)... and send new notifications.
>
>
> Is there any way to solve this? Especially that one doesn't get new 
> notifications sent, when the alert never really stopped?
>
> At least I wouldn't understand how keep_firing_for would do this.
>
> Thanks,
> Chris.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/fa157174-2d90-45f0-9084-dc28e52e88dan%40googlegroups.com.

[prometheus-users] Re: what to do about flapping alerts?

Reply via email to