Fantastic summary. This would actually make a really nice addition to the "guides" section of the Prometheus docs.
https://github.com/prometheus/docs/tree/main/content/docs/guides On Tue, Nov 28, 2023 at 11:18 AM 'Brian Candler' via Prometheus Users < [email protected]> wrote: > On Tuesday, 28 November 2023 at 04:15:41 UTC Chris Siebenmann wrote: > > The Blackbox exporter is a bit tricky to understand in relation to up{}, > because unlike many exporters you create multiple scrape targets against > (or through) the same exporter. This generally means you want to ignore > the up{} metric for any particular blackbox probe and instead scrape > Blackbox's metric endpoint and pay attention to its up{} (for alerts, > for example). > > > I think that's worded in a misleading way. > > Blackbox exporter does have a /metrics endpoint, but this is only for > metrics internal to the operation of blackbox_exporter itself (e.g. memory > stats, software version). You don't need to scrape this, but it gives you a > little bit of extra info about how your exporter is performing. > > Blackbox exporter's main interface is the /probe endpoint, where you tell > it to run individual tests: /probe?target=xxx&module=yyy > > The 'up' metric is generated by Prometheus itself, and only tells you that > it was successfully able to communicate with the exporter and get some > results (without a 4xx / 5xx error for example). So it's correct to say > that you're not interested in the 'up' metric for scrapes to /probe, since > it will always be 1 unless blackbox_exporter itself is badly broken, and > you're interested in probe_success instead. > > This is pretty easy to arrange in alerting rules. Here's a starting point: > > groups: > - name: UpDown > rules: > - alert: UpDown > expr: up == 0 > for: 3m > keep_firing_for: 3m > labels: > severity: critical > annotations: > summary: 'Scrape failed: host is down or scrape endpoint > down/unreachable' > - name: BlackboxRules > rules: > - alert: ProbeFail > expr: probe_success == 0 > for: 3m > keep_firing_for: 3m > labels: > severity: critical > annotations: > description: | > {{ $labels.instance }} ({{ $labels.module }}) probe is failing > summary: Probed service is down > > For Grafana I'd probably just make two dashboards, but if you really want > a grand summary of all "problems" then you can simply use a PromQL > expression like this: > > up == 0 or probe_success == 0 > > The "or" operator > <https://prometheus.io/docs/prometheus/latest/querying/operators/#logical-set-binary-operators> > in PromQL is not a boolean: it's more like a set union operator. It will > give you all the values of the "up" vector where the value is 0, along with > all values of the "probe_success" vector where the value is 0 (except for > values of probe_success == 0 which have *exactly* the same labels as up == > 0, but those are unlikely anyway) > > The consumer of this query is going to see a mixture of up{...} and > probe_success{...} metrics, all with value 0. > > there are other multi-target > indirect exporters like Blackbox. I believe that the SNMP exporter is > another one where you often have one exporter separately scraping a lot > of targets, and each target will have its own up{} metric that you > probably want to ignore.) > > > The first part of that is correct: SNMP exporter uses > /snmp?target=xxx&module=yyy&auth=zzz. > > But the second part is wrong: if SNMP exporter fails to talk to the target > then it returns an empty scrape with a 4xx/5xx error code, which prometheus > turns into up==0. So you definitely *do* want to alert on up==0 in this > case, as that's how you detect a device which is failing to respond to SNMP. > > > > > In our environment, it's useful for us to have a granular view of what > has failed. That a device has stopped pinging is a different issue than > its node_exporter not being up, so our dashboards (and alerts) reflect > that. > > > I agree with that. Different metrics inherently have different meanings, > and although 'up' and 'probe_success' have similar 0/1 semantics, there's > other information you can get from blackbox_exporter when probe_success==0 > which can tell you more about the nature of the problem (e.g. failure to > connect, failure to resolve DNS name, TLS certificate validation failure > etc) > > -- > You received this message because you are subscribed to the Google Groups > "Prometheus Users" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/prometheus-users/adf18a14-269f-41a3-b60f-d8c7a49858ean%40googlegroups.com > <https://groups.google.com/d/msgid/prometheus-users/adf18a14-269f-41a3-b60f-d8c7a49858ean%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/CABbyFmr3Xe5Bci323hU6KEejtab%3DykFqjcJ9Saf%2Bhun3eik-hg%40mail.gmail.com.

