Fantastic summary. This would actually make a really nice addition to the
"guides" section of the Prometheus docs.

https://github.com/prometheus/docs/tree/main/content/docs/guides

On Tue, Nov 28, 2023 at 11:18 AM 'Brian Candler' via Prometheus Users <
[email protected]> wrote:

> On Tuesday, 28 November 2023 at 04:15:41 UTC Chris Siebenmann wrote:
>
> The Blackbox exporter is a bit tricky to understand in relation to up{},
> because unlike many exporters you create multiple scrape targets against
> (or through) the same exporter. This generally means you want to ignore
> the up{} metric for any particular blackbox probe and instead scrape
> Blackbox's metric endpoint and pay attention to its up{} (for alerts,
> for example).
>
>
> I think that's worded in a misleading way.
>
> Blackbox exporter does have a /metrics endpoint, but this is only for
> metrics internal to the operation of blackbox_exporter itself (e.g. memory
> stats, software version). You don't need to scrape this, but it gives you a
> little bit of extra info about how your exporter is performing.
>
> Blackbox exporter's main interface is the /probe endpoint, where you tell
> it to run individual tests: /probe?target=xxx&module=yyy
>
> The 'up' metric is generated by Prometheus itself, and only tells you that
> it was successfully able to communicate with the exporter and get some
> results (without a 4xx / 5xx error for example).  So it's correct to say
> that you're not interested in the 'up' metric for scrapes to /probe, since
> it will always be 1 unless blackbox_exporter itself is badly broken, and
> you're interested in probe_success instead.
>
> This is pretty easy to arrange in alerting rules. Here's a starting point:
>
> groups:
> - name: UpDown
>   rules:
>   - alert: UpDown
>     expr: up == 0
>     for: 3m
>     keep_firing_for: 3m
>     labels:
>       severity: critical
>     annotations:
>       summary: 'Scrape failed: host is down or scrape endpoint
> down/unreachable'
> - name: BlackboxRules
>   rules:
>   - alert: ProbeFail
>     expr: probe_success == 0
>     for: 3m
>     keep_firing_for: 3m
>     labels:
>       severity: critical
>     annotations:
>       description: |
>         {{ $labels.instance }} ({{ $labels.module }}) probe is failing
>       summary: Probed service is down
>
> For Grafana I'd probably just make two dashboards, but if you really want
> a grand summary of all "problems" then you can simply use a PromQL
> expression like this:
>
>     up == 0 or probe_success == 0
>
> The "or" operator
> <https://prometheus.io/docs/prometheus/latest/querying/operators/#logical-set-binary-operators>
> in PromQL is not a boolean: it's more like a set union operator.  It will
> give you all the values of the "up" vector where the value is 0, along with
> all values of the "probe_success" vector where the value is 0 (except for
> values of probe_success == 0 which have *exactly* the same labels as up ==
> 0, but those are unlikely anyway)
>
> The consumer of this query is going to see a mixture of up{...} and
> probe_success{...} metrics, all with value 0.
>
>  there are other multi-target
> indirect exporters like Blackbox. I believe that the SNMP exporter is
> another one where you often have one exporter separately scraping a lot
> of targets, and each target will have its own up{} metric that you
> probably want to ignore.)
>
>
> The first part of that is correct: SNMP exporter uses
> /snmp?target=xxx&module=yyy&auth=zzz.
>
> But the second part is wrong: if SNMP exporter fails to talk to the target
> then it returns an empty scrape with a 4xx/5xx error code, which prometheus
> turns into up==0.  So you definitely *do* want to alert on up==0 in this
> case, as that's how you detect a device which is failing to respond to SNMP.
>
>
>
>
> In our environment, it's useful for us to have a granular view of what
> has failed. That a device has stopped pinging is a different issue than
> its node_exporter not being up, so our dashboards (and alerts) reflect
> that.
>
>
> I agree with that. Different metrics inherently have different meanings,
> and although 'up' and 'probe_success' have similar 0/1 semantics, there's
> other information you can get from blackbox_exporter when probe_success==0
> which can tell you more about the nature of the problem (e.g. failure to
> connect, failure to resolve DNS name, TLS certificate validation failure
> etc)
>
> --
> You received this message because you are subscribed to the Google Groups
> "Prometheus Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/prometheus-users/adf18a14-269f-41a3-b60f-d8c7a49858ean%40googlegroups.com
> <https://groups.google.com/d/msgid/prometheus-users/adf18a14-269f-41a3-b60f-d8c7a49858ean%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/CABbyFmr3Xe5Bci323hU6KEejtab%3DykFqjcJ9Saf%2Bhun3eik-hg%40mail.gmail.com.

Reply via email to