[prometheus-users] Re: what to do about flapping alerts?

'Brian Candler' via Prometheus Users Mon, 08 Apr 2024 14:05:47 -0700

On Monday 8 April 2024 at 20:57:34 UTC+1 Christoph Anton Mitterer wrote:

Assume the following (arguably a bit made up) example:
One has a metric that counts the number of failed drives in a RAID. One 
drive fails so some alert starts firing. Eventually the computing centre 
replaces the drive and it starts rebuilding (guess it doesn't matter 
whether the rebuilding is still considered to cause an alert or not). 
Eventually it finishes and the alert should go away (and I should e.g. get 
a resolved message).
But because of keep_firing_for, it doesn't stop straight away.
Now before it does, yet another disk fails.
But for Prometheus, with keep_firing_for, it will be like the same alert.

If the alerts have the exact same set of labels (e.g. the alert is at the
level of the RAID controller, not at the level of individual drives) then
yes.

It failed, it fixed, it failed again within keep_firing_for: then you only
get a single alert, with no additional notification.

But that's not the problem you originally asked for:

"When the target goes down, the alert clears and as soon as it's back, it
pops up again, sending a fresh alert notification."

keep_firing_for can be set differently for different alerts. So you can
set it to 10m for the "up == 0" alert, and not set it at all for the RAID
alert, if that's what you want.

Also, depending on how large I have to set keep_firing_for, I will also get
resolve messages later... which depending on what one does with the alerts
may also be less desirable.

Surely that delay is essential for the de-flapping scenario you describe:
you can't send the alert resolved message until you are *sure* the alert
has resolved (i.e. after keep_firing_for).

Conversely: if you sent the alert resolved message immediately (before
keeping_firing_for had expired), and the problem recurred, then you'd have
to send out a new alert failing message - which is the flap noise I think
you are asking to suppress.

In any case, sending out resolved messages is arguably a bad idea:
https://www.robustperception.io/running-into-burning-buildings-because-the-fire-alarm-stopped

I turned them off, and:
(a) it immediately reduced notifications by 50%
(b) it encourages that alerts are properly investigated (or that alerts are
properly tuned)

That is: if something was important enough to alert on in the first place,
then it's important enough to investigate thoroughly, even if the threshold
has been crossed back to normal since then. And if it wasn't important
enough to alert on, then the alerting rule needs adjusting to make it less
noisy.

This is expanded upon in this document:
https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit

I think the main problem behind may be rather a conceptual one, namely that
Prometheus uses "no data" for no alert, which happens as well when there is
no data because of e.g. scrape failures, so it can’t really differentiate
between the two conditions.

I think it can.

Scrape failures can be explicitly detected by up == 0. Alert on those
separately.

The odd occasional missed scrape doesn't affect most other queries because
of the lookback-delta: i.e. instant vector queries will look up to 5
minutes into the past. As long as you're scraping every 2 minutes, you can
always survive a single failed scrape without noticing it.

If your device goes away for longer than 5 minutes, then sure the alerting
data will no longer be there - but then you have no idea whether the
condition you were alerting on or not exists (since you have no visibility
of the target state). Instead, you have a "scrape failed" condition, which
as I said already, is easy to alert on.

--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/6e6de7dd-b156-475f-b76d-6f758f2c3189n%40googlegroups.com.

[prometheus-users] Re: what to do about flapping alerts?

Reply via email to