Hey Chris.
On Thursday, April 4, 2024 at 8:41:02 PM UTC+2 Chris Siebenmann wrote:
> - The evaluation interval is sufficiently less than the scrape
> interval, so that it's guaranteed that none of the `up`-samples are
> being missed.
I assume you were referring to the above specific point?
Maybe there is a misunderstanding:
With the above I merely meant that, my solution requires that the alert
rule evaluation interval is small enough, so that when it look at
resets(up[20s] offset 60s) (which is the window from -70s to -50s PLUS an
additional shift by 10s, so effectively -80s to -60s), the evaluations
happen often enough, so that no sample can "jump over" that time window.
I.e. if the scrape interval was 10s, but the evaluation interval only 20s,
it would surely miss some.
I don't believe this assumption about up{} is correct. My understanding
is that up{} is not merely an indication that Prometheus has connected
to the target exporter, but an indication that it has successfully
scraped said exporter. Prometheus can only know this after all samples
from the scrape target have been received and ingested and there are no
unexpected errors, which means that just like other metrics from the
scrape, up{} can only be visible after the scrape has finished (and
Prometheus knows whether it succeeded or not).
Yes, I'd have assumed so as well. Therefore I generally shifted both alerts
by 10s, hoping that 10s is enough for all that.
How long scrapes take is variable and can be up to almost their timeout
interval. You may wish to check 'scrape_duration_seconds'. Our metrics
suggest that this can go right up to the timeout (possibly in the case
of failed scrapes).
Interesting.
I see the same (I mean entries that go up to and even a bit above the
timeout). Would be interesting to know whether these are ones that still
made it "just in time (despite actually being a bit longer than the
timeout)... or whether these are only such that timed out and were
discarded.
Cause the name scrape_duration_seconds would kind of imply that it's the
former, but I guess it's actually the latter.
So what would you think that means for me and my solution now? The I should
shift all my checks even further? That is at least the scrape_timeout +
some extra time for the data getting into the TDSB?
Thanks,
Chris.
--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/f6603b09-d44b-412d-831a-c53234c85a82n%40googlegroups.com.