Hey,
Having an interesting issue with Prom and Alert manager, im 99% sure its a
config issue, but having a hard time figuring it out.
We have a group of polls that use the blackbox exporter to ping some
endpoints. It pings once every 30 seconds.
The rule looks like this
- name: blackbox.rules.icmpFailed
rules:
- alert: BlackboxIcmpFailed
expr: probe_icmp_duration_seconds == 0
for: 5m
labels:
severity: critical
annotations:
summary: Ping to Device Failed.
And our alert manager config look like this
spec:
route:
groupBy: [ 'instance','severity' ]
groupWait: 30s
groupInterval: 5m
repeatInterval: 12h
Now here is what I am seeing.
If we have a single ping failure then an alert message is sent to slack,
which immediately clears on the next 5 min cycle.
I thought having the "for: 5m" should mean that an alert is ONLY sent if
that condition has been seen for 5 mins consecutively. As you can imagine
this leads to lots of angst :D
Any ideas?
--
This email contains information which is private and confidential, all
commercial rights to the details included are owned exclusively by Nscale.
Disclosure without written permission is strictly prohibited. If you have
received this email in error, please inform me as soon as possible.
--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion visit
https://groups.google.com/d/msgid/prometheus-users/a7d57b2c-2b27-4afb-b079-a7c0d27ecb1bn%40googlegroups.com.