On 06.06.21 09:56, Christian Galsterer wrote: > There are metrics for the actual scrape duration but currently there are no > metrics for the scrape timeouts. Adding metrics for the scrape timeout > would it make possible to monitor and alert on scrape timeouts without > hard-coding the timeouts in the PromQL queries but the new metric can be > used.
Sounds like a good idea at first glance, but note that this would be yet another metric that gets automatically added to every single target. I think we have to be careful when doing so. Your proposal mirrors a part of the configuration into metrics. That is sometimes a neat thing to do, but it has to be enjoyed responsibly. In this case, you want to specifically alert on scrape timeouts (or, I guess, approaching them). The same argument could be made to alert on exceeding (or approaching) the sample limit. So we need a new scrape metric for the `sample_limit` configuration setting, too. The same is true for all the other limits: `label_limit`, `label_name_length_limit`, `label_value_length_limit`, `target_limit`. So we have to add _six_ new metrics. Also, I had a bunch of situations where I would have liked to know the intended scrape interval of a series (rather than guessing it from the spacing I could see in the samples of the series). So yet another metric for the configured scrape interval. Things are getting out of control here... The question is, of course, why you would like to alert on scrape timeout specifically. There are many possible reasons why a scrape fails. Generally, I would recommend to just alert on `up` being zero too often. If that alert fires, you can then checkout the Prometheus server in question and investigate _why_ the scrapes are failing. Interestingly, we have a metric `prometheus_rule_group_interval_seconds` for the configured evaluation interval of a rule group. Note, however, that this is not a synthetic metric injected alongside the evaluation result of the rule, but only exposed by the `/metrics` endpoint of Prometheus itself. That's only one metric per rule group, and it's exposed for meta-monitoring, which could be on a separate server, so it doesn't "pollute" the normal metrics. In summary, I'm pretty sure we shouldn't add half a dozen synthetic metrics for each target to mirror its configuration into metrics. But perhaps we could add more metrics for meta-monitoring. Have a look at the already existing metrics beginning with `prometheus_target_...`. There is for example `prometheus_target_scrapes_exceeded_sample_limit_total`, but note that this is just one metric for the whole server. It's mostly meant to get a specific alerts if _any_ targets run into the sample limit. Perhaps the same could be done for timeouts as `prometheus_target_scrapes_exceeded_scrape_timeout`. -- Björn Rabenstein [PGP-ID] 0x851C3DA17D748D03 [email] [email protected] -- You received this message because you are subscribed to the Google Groups "Prometheus Developers" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/20210609162547.GO3670%40jahnn.

