Re: [prometheus-users] Assistance Needed with Prometheus and Alertmanager Configuration

Trio Official Sat, 30 Mar 2024 02:49:10 -0700


Thank you for your prompt response and guidance on addressing the metric 
staleness issue.

Regarding metric staleness  I confirm that I have already implemented the 
approach to use square brackets for the recording metrics and alerting rule 
(e.g. 
max_over_time(metric[1h])). However, the main challenge persists with the 
discrepancy in the number of alerts generated by Prometheus compared to 
those displayed in Alertmanager. 

To illustrate, when observing Prometheus, I may observe approximately 
25,000 alerts triggered within a given period. However, when reviewing the 
corresponding alerts in Alertmanager, the count often deviates 
significantly, displaying figures such as 10,000 or 18,000, rather than the 
expected 25,000.

This inconsistency poses a significant challenge in our alert management 
process, leading to confusion and potentially overlooking critical alerts.

I would greatly appreciate any further insights or recommendations you may 
have to address this issue and ensure alignment between Prometheus and 
Alertmanager in terms of the number of alerts generated and displayed.
On Saturday, March 30, 2024 at 2:29:42 PM UTC+5:30 Brian Candler wrote:

> On Friday 29 March 2024 at 22:09:18 UTC Chris Siebenmann wrote:
>
> I believe that recording rules and alerting rules similarly may have 
> their evaluation time happen at different offsets within their 
> evaluation interval. This is done for the similar reason of spreading 
> out the internal load of rule evaluations across time.
>
>
> I think it's more accurate to say that *rule groups* are spread spread 
> over their evaluation interval, and rules within the same rule group are 
> evaluated 
> sequentially 
> <https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/#recording-rules>.
>  
> This is how you can build rules that depend on each other, e.g. a recording 
> rule followed by other rules that depend on its output; put them in the 
> same rule group.
>
> As for scraping: you *can* change this staleness interval, 
> using --query.lookback-delta, but it's strongly not recommended. Using the 
> default of 5 mins, you should use a maximum scrape interval of 2 mins so 
> that even if you miss one scrape for a random reason, you still have two 
> points within the lookback-delta so that the timeseries does not go stale.
>
> There's no good reason to scrape at one hour intervals:
> * Prometheus is extremely efficient with its storage compression, 
> especially when adjacent data points are equal, so scraping the same value 
> every 2 minutes is going to use hardly any more storage than scraping it 
> every hour.
> * If you're worried about load on the exporter because responding to a 
> scrape is slow or expensive, then you should run the exporter every hour 
> from a local cronjob, and write its output to a persistent location (e.g. 
> to PushGateway or statsd_exporter, or simply write it to a file which can 
> be picked up by node_exporter textfile-collector or even a vanilla HTTP 
> server).  You can then scrape this as often as you like.
>
> node_exporter textfile-collector exposes an extra metrics for the 
> timestamp on each file, so you can alert in the case that the file isn't 
> being updated.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/3cfa1ba2-ef9c-4e9c-be3b-1f8ae8067e7en%40googlegroups.com.

Re: [prometheus-users] Assistance Needed with Prometheus and Alertmanager Configuration

Reply via email to