Hey Chris.

On Thursday, April 4, 2024 at 8:41:02 PM UTC+2 Chris Siebenmann wrote:

> - The evaluation interval is sufficiently less than the scrape 
> interval, so that it's guaranteed that none of the `up`-samples are 
> being missed. 


I assume you were referring to the above specific point?

Maybe there is a misunderstanding:

With the above I merely meant that, my solution requires that the alert 
rule evaluation interval is small enough, so that when it look at 
resets(up[20s] offset 60s) (which is the window from -70s to -50s PLUS an 
additional shift by 10s, so effectively -80s to -60s), the evaluations 
happen often enough, so that no sample can "jump over" that time window.

I.e. if the scrape interval was 10s, but the evaluation interval only 20s, 
it would surely miss some.
 

I don't believe this assumption about up{} is correct. My understanding 
is that up{} is not merely an indication that Prometheus has connected 
to the target exporter, but an indication that it has successfully 
scraped said exporter. Prometheus can only know this after all samples 
from the scrape target have been received and ingested and there are no 
unexpected errors, which means that just like other metrics from the 
scrape, up{} can only be visible after the scrape has finished (and 
Prometheus knows whether it succeeded or not). 


Yes, I'd have assumed so as well. Therefore I generally shifted both alerts 
by 10s, hoping that 10s is enough for all that.

 

How long scrapes take is variable and can be up to almost their timeout 
interval. You may wish to check 'scrape_duration_seconds'. Our metrics 
suggest that this can go right up to the timeout (possibly in the case 
of failed scrapes). 


Interesting. 

I see the same (I mean entries that go up to and even a bit above the 
timeout). Would be interesting to know whether these are ones that still 
made it "just in time (despite actually being a bit longer than the 
timeout)... or whether these are only such that timed out and were 
discarded.
Cause the name scrape_duration_seconds would kind of imply that it's the 
former, but I guess it's actually the latter.

So what would you think that means for me and my solution now? The I should 
shift all my checks even further? That is at least the scrape_timeout + 
some extra time for the data getting into the TDSB?


Thanks,
Chris.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/f6603b09-d44b-412d-831a-c53234c85a82n%40googlegroups.com.

Reply via email to