[prometheus-users] Re: what to do about flapping alerts?

Christoph Anton Mitterer Mon, 08 Apr 2024 15:33:21 -0700


On Monday, April 8, 2024 at 11:05:41 PM UTC+2 Brian Candler wrote:

On Monday 8 April 2024 at 20:57:34 UTC+1 Christoph Anton Mitterer wrote:

But for Prometheus, with keep_firing_for, it will be like the same alert.

If the alerts have the exact same set of labels (e.g. the alert is at the
level of the RAID controller, not at the level of individual drives) then
yes.

Which will still be quite often the case, I guess. Sometimes it may not
matter, i.e. when a "new" alert (which has the same label set) is "missed"
because of keep_firing_for, but sometimes it may.

It failed, it fixed, it failed again within keep_firing_for: then you only
get a single alert, with no additional notification.
But that's not the problem you originally asked for:
"When the target goes down, the alert clears and as soon as it's back, it
pops up again, sending a fresh alert notification."

Sure, and this can be avoided with keep_firing_for, but as far as I can see
only in some cases (since one wants to keep keep_firing_for shortish) and
at a cost of loosing information when the alert condition actually went
away (which Prometheus does can in principle know) and came back while
still firing.

keep_firing_for can be set differently for different alerts. So you can
set it to 10m for the "up == 0" alert, and not set it at all for the RAID
alert, if that's what you want.

If there was no other way than the current keep_firing_for respectively my
idea for an alternative keep_firing_for that considers the up/down state of
the queried metrics isn't possible and/or reasonable - then rather than
being able to set keep_firing_for per alert I'd wish to be able to set it
per queried instance.

For some cases what I'm working at the university it might have been a nice
try to (automatically) query the status of an alert and take action if it
fires, but then I'd also rather like to stop that, rather soon after the
alert (actually) stops. If I have to use a longer keep_firing_for because
of a set of unstable nodes, then either, I get the penalty of unnecessarily
long firing alerts for all nodes, or I maintain different set of alerts,
which would be possible but also quite ugly.

Surely that delay is essential for the de-flapping scenario you describe:
you can't send the alert resolved message until you are *sure* the alert
has resolved (i.e. after keep_firing_for).

Conversely: if you sent the alert resolved message immediately (before
keeping_firing_for had expired), and the problem recurred, then you'd have
to send out a new alert failing message - which is the flap noise I think
you are asking to suppress.

Okay maybe we have a misunderstanding here, or better said, I guess there
are two kinds of flapping alerts:

For example, assume an alert that monitors the utilised disk space on the
root fs, and fires whenever that's above 80%.

Type 1 Flapping:
- The scraping of the metrics works all the time (i.e. `up` is all the time
1).
- But IO is happening, that just causes the 80% to be exceeded and then
fallen below every few seconds.

Type 2 Flapping
- There is IO, but the utilisation is always above 80%, say it's already at
~ 90% all the time.
- My scrapes fail every now and then[0]

I honestly haven't even thought about type 1 yet. But I think these are the
ones which would be perfectly solved by keep_firing_for.
Well even there I'd still like to be able to have the keep_firing_for
applied only to a given label set e.g. something like: keep_firing_for: 10m
on {alertnames~="regex-for-my-known-flapping-alerts"}

Type 2 is the one that causes me headaches right now.

That is why I thought before, it could be solved by something like
keep_firing_for but that also takes into account whether any of the metrics
it queries were from a target that is "currently" down - and only then let
keep_firing_for take effect.

Thanks,
Chris.

[0] I do have a number of hosts, where this constantly happen, not really
sure why TBH, but even with niceness of -20 and IOniceness of 0 (though in
best-effort class) it happens quite often. The node is under high load
(it's one of our compute node for the LHC Computing Grid)... so I guess
maybe it's just "overloaded". So I don't think this will go away and I
somehow have to get it working with the scrapes failing every now and then.

What actually puzzled me more is this:
[image: Screenshot from 2024-04-09 00-24-59.png]
That's some of the graphs from the Node Full Exporter Grafana dashboard,
all for one node (which is one of the flapping ones).
As you can see, Memory Basic and Disc Space Used Basic have a gap, where
scraping failed.
My assumption was, that - for a given target&instance - either scraping
fails for all metrics or succeeds for all.
But here, only the right side plots have gaps, the left side ones don't.

Maybe that's just some consequence of these using counters and rate() or
irate()?

--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/d8c6ff8e-f820-4ed3-a8e4-c8cbc79f40d6n%40googlegroups.com.

[prometheus-users] Re: what to do about flapping alerts?

Reply via email to