[prometheus-users] Re: Alertmanager frequently sending erroneous resolve notifications

'Brian Candler' via Prometheus Users Sat, 18 May 2024 13:54:13 -0700

> What can be done?

Perhaps the alert condition resolved very briefly. The solution with modern 
versions of prometheus (v2.42.0 
<https://github.com/prometheus/prometheus/releases/v2.42.0> or later) is to 
do this:

for: 2d
keep_firing_for: 10m

The alert won't be resolved unless it has been *continuously* absent for 10 
minutes. (Of course, this means your "resolved" notifications will be 
delayed by 10 minutes - but that's basically the whole point, don't send 
them until you're sure they're not going to retrigger)

The other alternative is simply to turn off resolved notifications 
entirely. This approach sounds odd but has a lot to recommend it:
https://www.robustperception.io/running-into-burning-buildings-because-the-fire-alarm-stopped
https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit
https://blog.cloudflare.com/alerts-observability

The point is that if a problem occurred which was serious enough to alert 
on, then it requires investigation before the case can be "closed": either 
there's an underlying problem, or if it was a false positive then the alert 
condition needs tuning. Sending a resolved message encourages laziness 
("oh, it fixed itself, no further work required").  Also, turning off 
resolved messages instantly reduces your notifications by 50%.

On Saturday 18 May 2024 at 19:50:32 UTC+1 Sarah Dundras wrote:

> Hi, this problem is driving me mad: 
>
> I am monitoring backups that log their backup results to a textfile. It is 
> being picked up and all is well, also the alert are ok, BUT! Alertmanager 
> frequently sends out odd "resolved" notifications although the firing 
> status never changed! 
>
> Here's such an alert rule that does this: 
>
> - alert: Restic Prune Freshness
> expr: restic_prune_status{uptodate!="1"} and 
> restic_prune_status{alerts!="0"}
> for: 2d
> labels:
> topic: backup
> freshness: outdated
> job: "{{ $labels.restic_backup }}"
> server: "{{ $labels.server }}"
> product: veeam
> annotations:
> description: "Restic Prune for '{{ $labels.backup_name }}' on host '{{ 
> $labels.server_name }}' is not up-to-date (too old)"
> host_url: "
> https://backups.example.com/d/3be21566-3d15-4238-a4c5-508b059dccec/restic?orgId=2&var-server_name={{
>  
> $labels.server_name }}&var-result=0&var-backup_name=All"
> service_url: "
> https://backups.example.com/d/3be21566-3d15-4238-a4c5-508b059dccec/restic?orgId=2&var-server_name=All&var-result=0&var-backup_name={{
>  
> $labels.backup_name }}"
> service: "{{ $labels.job_name }}" 
>
> What can be done? 
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/3f1dc4fe-9378-4b6f-a7eb-cc0e7e02bcfan%40googlegroups.com.

[prometheus-users] Re: Alertmanager frequently sending erroneous resolve notifications

Reply via email to