On Monday 8 April 2024 at 20:57:34 UTC+1 Christoph Anton Mitterer wrote: Assume the following (arguably a bit made up) example: One has a metric that counts the number of failed drives in a RAID. One drive fails so some alert starts firing. Eventually the computing centre replaces the drive and it starts rebuilding (guess it doesn't matter whether the rebuilding is still considered to cause an alert or not). Eventually it finishes and the alert should go away (and I should e.g. get a resolved message). But because of keep_firing_for, it doesn't stop straight away. Now before it does, yet another disk fails. But for Prometheus, with keep_firing_for, it will be like the same alert.
If the alerts have the exact same set of labels (e.g. the alert is at the level of the RAID controller, not at the level of individual drives) then yes. It failed, it fixed, it failed again within keep_firing_for: then you only get a single alert, with no additional notification. But that's not the problem you originally asked for: "When the target goes down, the alert clears and as soon as it's back, it pops up again, sending a fresh alert notification." keep_firing_for can be set differently for different alerts. So you can set it to 10m for the "up == 0" alert, and not set it at all for the RAID alert, if that's what you want. Also, depending on how large I have to set keep_firing_for, I will also get resolve messages later... which depending on what one does with the alerts may also be less desirable. Surely that delay is essential for the de-flapping scenario you describe: you can't send the alert resolved message until you are *sure* the alert has resolved (i.e. after keep_firing_for). Conversely: if you sent the alert resolved message immediately (before keeping_firing_for had expired), and the problem recurred, then you'd have to send out a new alert failing message - which is the flap noise I think you are asking to suppress. In any case, sending out resolved messages is arguably a bad idea: https://www.robustperception.io/running-into-burning-buildings-because-the-fire-alarm-stopped I turned them off, and: (a) it immediately reduced notifications by 50% (b) it encourages that alerts are properly investigated (or that alerts are properly tuned) That is: if something was important enough to alert on in the first place, then it's important enough to investigate thoroughly, even if the threshold has been crossed back to normal since then. And if it wasn't important enough to alert on, then the alerting rule needs adjusting to make it less noisy. This is expanded upon in this document: https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit I think the main problem behind may be rather a conceptual one, namely that Prometheus uses "no data" for no alert, which happens as well when there is no data because of e.g. scrape failures, so it can’t really differentiate between the two conditions. I think it can. Scrape failures can be explicitly detected by up == 0. Alert on those separately. The odd occasional missed scrape doesn't affect most other queries because of the lookback-delta: i.e. instant vector queries will look up to 5 minutes into the past. As long as you're scraping every 2 minutes, you can always survive a single failed scrape without noticing it. If your device goes away for longer than 5 minutes, then sure the alerting data will no longer be there - but then you have no idea whether the condition you were alerting on or not exists (since you have no visibility of the target state). Instead, you have a "scrape failed" condition, which as I said already, is easy to alert on. -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/6e6de7dd-b156-475f-b76d-6f758f2c3189n%40googlegroups.com.

