On Monday 8 April 2024 at 20:57:34 UTC+1 Christoph Anton Mitterer wrote:

Assume the following (arguably a bit made up) example:
One has a metric that counts the number of failed drives in a RAID. One 
drive fails so some alert starts firing. Eventually the computing centre 
replaces the drive and it starts rebuilding (guess it doesn't matter 
whether the rebuilding is still considered to cause an alert or not). 
Eventually it finishes and the alert should go away (and I should e.g. get 
a resolved message).
But because of keep_firing_for, it doesn't stop straight away.
Now before it does, yet another disk fails.
But for Prometheus, with keep_firing_for, it will be like the same alert.


If the alerts have the exact same set of labels (e.g. the alert is at the 
level of the RAID controller, not at the level of individual drives) then 
yes.

It failed, it fixed, it failed again within keep_firing_for: then you only 
get a single alert, with no additional notification.

But that's not the problem you originally asked for:

"When the target goes down, the alert clears and as soon as it's back, it 
pops up again, sending a fresh alert notification."

keep_firing_for can be set differently for different alerts.  So you can 
set it to 10m for the "up == 0" alert, and not set it at all for the RAID 
alert, if that's what you want.

 


Also, depending on how large I have to set keep_firing_for, I will also get 
resolve messages later... which depending on what one does with the alerts 
may also be less desirable.


Surely that delay is essential for the de-flapping scenario you describe: 
you can't send the alert resolved message until you are *sure* the alert 
has resolved (i.e. after keep_firing_for).

Conversely: if you sent the alert resolved message immediately (before 
keeping_firing_for had expired), and the problem recurred, then you'd have 
to send out a new alert failing message - which is the flap noise I think 
you are asking to suppress.

In any case, sending out resolved messages is arguably a bad idea:
https://www.robustperception.io/running-into-burning-buildings-because-the-fire-alarm-stopped

I turned them off, and:
(a) it immediately reduced notifications by 50%
(b) it encourages that alerts are properly investigated (or that alerts are 
properly tuned)

That is: if something was important enough to alert on in the first place, 
then it's important enough to investigate thoroughly, even if the threshold 
has been crossed back to normal since then. And if it wasn't important 
enough to alert on, then the alerting rule needs adjusting to make it less 
noisy.

This is expanded upon in this document:
https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit

 


I think the main problem behind may be rather a conceptual one, namely that 
Prometheus uses "no data" for no alert, which happens as well when there is 
no data because of e.g. scrape failures, so it can’t really differentiate 
between the two conditions.


I think it can.

Scrape failures can be explicitly detected by up == 0.  Alert on those 
separately.

The odd occasional missed scrape doesn't affect most other queries because 
of the lookback-delta: i.e. instant vector queries will look up to 5 
minutes into the past. As long as you're scraping every 2 minutes, you can 
always survive a single failed scrape without noticing it.

If your device goes away for longer than 5 minutes, then sure the alerting 
data will no longer be there - but then you have no idea whether the 
condition you were alerting on or not exists (since you have no visibility 
of the target state).  Instead, you have a "scrape failed" condition, which 
as I said already, is easy to alert on.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/6e6de7dd-b156-475f-b76d-6f758f2c3189n%40googlegroups.com.

Reply via email to