> but AFAIU that would simply affect all alerts, i.e. it wouldn't just keep
firing, when the scraping failed, but also when it actually goes back to an
ok state, right?
It affects all alerts individually, and I believe it's exactly what you
want. A brief flip from "failing" to "OK" doesn't resolve the alert; it
only resolves if it has remained in the "OK" state for the keep_firing_for
duration. Therefore you won't get a fresh alert until it's been OK for at
least keep_firing_for and *then* fails again.
As you correctly surmise, an alert isn't really a boolean condition, it's a
presence/absence condition: the expr returns a vector of 0 or more alerts,
each with a unique combination of labels. "keep_firing_for" retains a
particular labelled value in the vector for a period of time even if it's
no longer being generated by the alerting "expr". Hence if it does
reappear in the expr output during that time, it's just a continuation of
the previous alert.
> Similarly, when a node goes completely down (maintenance or so) and then
up again, all alerts would then start again to fire (and even a generous
keep_firing_for would have been exceeded)... and send new notifications.
I don't understand what you're saying here. Can you give some specific
examples?
If you have an alerting expression like "up == 0" and you take 10 machines
down then your alerting expression will return a vector of ten zeros and
this will generate ten alerts (typically grouped into a single
notification, if you use the default alertmanager config)
When they revert to up == 1 then they won't "start again to fire", because
they were already firing. Indeed, it's almost the opposite. Let's say you
have keep_firing_for: 10m, then if any machine goes down in the 10 minutes
after the end of maintenance then it *won't* generate a new alert, because
it will just be a continuation of the old one.
However, when you're doing maintenance, you might also be using silences to
prevent notifications. In that case you might want your silence to extend
10 minutes past the end of the maintenance period.
On Saturday 6 April 2024 at 04:03:07 UTC+1 Christoph Anton Mitterer wrote:
> Hey.
>
> I have some simple alerts like:
> - alert: node_upgrades_non-security_apt
> expr: 'sum by (instance,job) (
> apt_upgrades_pending{origin!~"(?i)^.*-security(?:\\PL.*)?$"} )'
> - alert: node_upgrades_security_apt
> expr: 'sum by (instance,job) (
> apt_upgrades_pending{origin=~"(?i)^.*-security(?:\\PL.*)?$"} )'
>
> If there's no upgrades, these give no value.
> Similarly, for all other simple alerts, like free disk space:
> 1 - node_filesystem_avail_bytes{mountpoint="/", fstype!="rootfs",
> instance!~"(?i)^.*\\.garching\\.physik\\.uni-muenchen\\.de$"} /
> node_filesystem_size_bytes > 0.80
>
> No value => all ok, some value => alert.
>
> I do have some instances which are pretty unstable (i.e. scraping fails
> every know and then - or more often than that), which are however mostly
> out of my control, so I cannot do anything about that.
>
> When the target goes down, the alert clears and as soon as it's back, it
> pops up again, sending a fresh alert notification.
>
> Now I've seen:
> https://github.com/prometheus/prometheus/pull/11827
> which describes keep_firing_for as "the minimum amount of time that an
> alert should remain firing, after the expression does not return any
> results", respectively in
> https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/#rule
>
> :
> # How long an alert will continue firing after the condition that
> triggered it # has cleared. [ keep_firing_for: <duration> | default = 0s ]
>
> but AFAIU that would simply affect all alerts, i.e. it wouldn't just keep
> firing, when the scraping failed, but also when it actually goes back to an
> ok state, right?
> That's IMO however rather undesirable.
>
> Similarly, when a node goes completely down (maintenance or so) and then
> up again, all alerts would then start again to fire (and even a generous
> keep_firing_for would have been exceeded)... and send new notifications.
>
>
> Is there any way to solve this? Especially that one doesn't get new
> notifications sent, when the alert never really stopped?
>
> At least I wouldn't understand how keep_firing_for would do this.
>
> Thanks,
> Chris.
>
--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/fa157174-2d90-45f0-9084-dc28e52e88dan%40googlegroups.com.