Re: [prometheus-users] Correlation between snmp scrape time and massive rate output for ifHCInOctets

Ben Kochie Sat, 16 Mar 2024 05:07:32 -0700

On Sat, Mar 16, 2024 at 12:16 PM Nick Carlton <[email protected]>
wrote:


> Thanks Ben, that makes sense. I suppose that was exasperated by the longer
> scrape times at the time.
>

That shouldn't make much of a difference, unless the remote storage does
something funny with the data. Prometheus tags timestamps at the start of
the scrape for consistency. The scrape duration does not affect the
timestamp at which the data represents. The idea is that it is up to the
target to lock any mutexes to provide a consistent data snapshot, then the
time it takes to ship that data over the wire is unimportant.


>
> With the second option of using a caching http proxy. I’m running each
> Prometheus and the snmp exporter on the same box so a separate instance per
> Prometheus instance, so while it’s a great idea, it would only cache the
> local snmp exporters results. I’ve tried to make this setup as resilient as
> possible without something like k8s. At the point of snmp walk caching
> coming in, for the same reason above I think I’ll have the same issue?
>

My idea is to use an external cache like memcached or redis. Something that
can share clustered caches between multiple instances of the exporter for
reliability.

I guess your best option there is to setup a separate node with the caching
proxy and the exporter. Then point both Prometheus to that one node.

One other idea. There are caching options that can use redis. I have no
idea if this is still viable / high quality, but you can try it:

https://github.com/desbouis/nginx-redis-proxy


>
> I think what I might have to do is pull the interface bandwidth counters
> out of the main snmp module and only scrape them from one of the instances,
> that way there is no risk of duplicate data hitting the remote write and
> also move anything else that I use query using “rate”.
>

That depends a bit on how the remote write service handles deduplication.
IIRC, services based on Cortex and Mimir deduplicate the whole connection.
So they assume that two instances connected are "identical". Again, this is
very much up to your remote storage implementation to figure out.


>
> Though I would love to contribute I’m not fluent enough in Go to offer any
> meaningful assistance :).
>
> Thanks
> Nick
>
> On Sat, 16 Mar 2024 at 09:38, Ben Kochie <[email protected]> wrote:
>
>> You can also execute the query via the Prometheus compatible API.
>>
>> https://prometheus.io/docs/prometheus/latest/querying/api/#instant-queries
>>
>> The same can be done via the Grafana datasource API endpoint.
>>
>> > managed endpoint and then the other end supposedly deduplicates the
>> metrics
>>
>> This is 99% likely the problem. The remote storage is deduplicating, but
>> it's flip-flopping between your two Prometheus instances data. Each
>> prometheus consistently psudo-randomizes the exact millisecond of the
>> scrape time to avoid load spikes on the targets. Since each Prometheus
>> instance is scraping at slightly different times, if the remote TSDB is
>> inserting one that is slightly older, a "newer" sample may actually be
>> slightly lower values from the devices. This tricks Prometheus into
>> thinking there was a counter reset, so it thinks there was the full
>> counter's value of data between the two scrapes.
>>
>> There are a few options:
>> * Use only one Prometheus server for SNMP targets to avoid the
>> deduplication happening on your remote write storage.
>> * Setup a caching HTTP reverse proxy between your Prometheus instances
>> and the snmp_exporter with a cache TTL that matches your scrape interval.
>> * Wait for / contribute to SNMP walk caching in the snmp_exporter.
>>
>> I would love to add a full SNMP walk cache to the snmp_exporter. I would
>> like to support memcached/redis as well for clustering persistence. But
>> since my $dayjob has no SNMP, it's hard for me to prioritize work on it.
>>
>> On Sat, Mar 16, 2024 at 9:39 AM Nick Carlton <[email protected]>
>> wrote:
>>
>>> Thanks both,
>>>
>>> I must be honest I never managed to get the generator to work with mib
>>> dependencies so have written my snmp.yml manually with other lookups etc so
>>> have never seen these values documented.
>>>
>>> Is there a best practice guide for their values when you are having
>>> certain issues or used in a way to speed up SNMP scrapes? I can’t seem to
>>> find any solid documentation.
>>>
>>> Ben - I’ll try and get that data, but this is a managed Prometheus so I
>>> don’t have access to the main Prometheus UI, just a built in version.
>>> Thought it should give me the same data. It’s possible there is duplicate
>>> data here because there are two Prometheus boxes polling these switches for
>>> the same metrics and sending duplicate data over remote write to the
>>> managed endpoint and then the other end supposedly deduplicates the
>>> metrics. Is there any way to defend against this on the side I can control?
>>>
>>> Thanks
>>> Nick
>>>
>>> On Sat, 16 Mar 2024 at 08:10, Alexander Wilke <[email protected]>
>>> wrote:
>>>
>>>> https://github.com/prometheus/snmp_exporter/tree/main/generator
>>>>
>>>> Alexander Wilke schrieb am Samstag, 16. März 2024 um 09:08:44 UTC+1:
>>>>
>>>>> Check File Format example.
>>>>>
>>>>> Time Out, retries, max-repetition.
>>>>>
>>>>> I use Repetition 50 or 100 with Cisco, retries 0 and Time Out 1s or
>>>>> 500ms below Prometheus timeout
>>>>>
>>>>> Ben Kochie schrieb am Samstag, 16. März 2024 um 06:31:17 UTC+1:
>>>>>
>>>>>> This is very likely a problem with counter resets or some other kind
>>>>>> of duplicate data.
>>>>>>
>>>>>> The best way to figure this out is to perform the query, but without
>>>>>> the `rate()` function.
>>>>>>
>>>>>> This can be done via the Prometheus UI (harder to do in Grafana) in
>>>>>> the "Table" view.
>>>>>>
>>>>>> Here is an example demo query
>>>>>> <https://prometheus.demo.do.prometheus.io/graph?g0.expr=process_cpu_seconds_total%7Bjob%3D%22prometheus%22%7D%5B2m%5D&g0.tab=1&g0.display_mode=lines&g0.show_exemplars=0&g0.range_input=1h>
>>>>>>
>>>>>> The results is a list of the raw samples that are needed to debug.
>>>>>>
>>>>>> On Fri, Mar 15, 2024 at 11:41 PM Nick Carlton <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hello Everyone,
>>>>>>>
>>>>>>> I have just seen something weird in my environment where I saw
>>>>>>> interface bandwidth on a gigabit switch reach about 1tbps on some of the
>>>>>>> interfaces.....
>>>>>>>
>>>>>>> Here is the query im using:
>>>>>>>
>>>>>>> rate(ifHCInOctets{ifHCInOctetsIntfName=~".*.\\/.*.",instance="<device-name>"}[2m])
>>>>>>> * 8
>>>>>>>
>>>>>>> Which ive never had a problem with. Here is an image of the graph
>>>>>>> showing the massive increase in bandwidth and then decrease back to 
>>>>>>> normal:
>>>>>>>
>>>>>>> [image: Screenshot 2024-03-15 222353.png]
>>>>>>>
>>>>>>> When Ive done some more investigation into what could have happened,
>>>>>>> I can see that the 'snmp_scrape_duration_seconds' metric increases to
>>>>>>> around 20s at the time. So the cisco switch is talking 20 seconds to
>>>>>>> respond to the SNMP request.
>>>>>>>
>>>>>>> [image: Screenshot 2024-03-15 222244.png]
>>>>>>>
>>>>>>> Im a bit confused as to how this could cause the rate query to give
>>>>>>> completely false data? Could the delay in data have caused prometheus to
>>>>>>> think there was more bandwidth on the interface? The switch certainly
>>>>>>> cannot do the speeds the graph is claiming!
>>>>>>>
>>>>>>> Im on v0.25.0 on the SNMP exporter and its normally sat around 2s
>>>>>>> for the scrapes. Im not blaming the exporter for the high response 
>>>>>>> times,
>>>>>>> thats probably the switch. Just wondering if in some way the high 
>>>>>>> response
>>>>>>> time could cause the rate query to give incorrect data. The fact the 
>>>>>>> graph
>>>>>>> went back to normal post the high reponse times makes me think it wasn't
>>>>>>> the switch giving duff data.
>>>>>>>
>>>>>>> Anyone seen this before and is there any way to mitigate? Happy to
>>>>>>> provide more info if required :)
>>>>>>>
>>>>>>> Thanks
>>>>>>> Nick
>>>>>>>
>>>>>>> --
>>>>>>> You received this message because you are subscribed to the Google
>>>>>>> Groups "Prometheus Users" group.
>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>> send an email to [email protected].
>>>>>>> To view this discussion on the web visit
>>>>>>> https://groups.google.com/d/msgid/prometheus-users/6fd3dca6-2013-47ad-af8f-3344e79954a7n%40googlegroups.com
>>>>>>> <https://groups.google.com/d/msgid/prometheus-users/6fd3dca6-2013-47ad-af8f-3344e79954a7n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>> .
>>>>>>>
>>>>>> --
>>>> You received this message because you are subscribed to a topic in the
>>>> Google Groups "Prometheus Users" group.
>>>> To unsubscribe from this topic, visit
>>>> https://groups.google.com/d/topic/prometheus-users/poGtu50nisA/unsubscribe
>>>> .
>>>> To unsubscribe from this group and all its topics, send an email to
>>>> [email protected].
>>>> To view this discussion on the web visit
>>>> https://groups.google.com/d/msgid/prometheus-users/4cd1e6b8-fa73-4ee0-92c0-c504c161870bn%40googlegroups.com
>>>> <https://groups.google.com/d/msgid/prometheus-users/4cd1e6b8-fa73-4ee0-92c0-c504c161870bn%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "Prometheus Users" group.
>>>
>> To unsubscribe from this group and stop receiving emails from it, send an
>>> email to [email protected].
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/prometheus-users/CAC4WY5-ZthvvcKqLsta0v%2B-F1dZbif_nnaWXu-xygOpaTtLGTw%40mail.gmail.com
>>> <https://groups.google.com/d/msgid/prometheus-users/CAC4WY5-ZthvvcKqLsta0v%2B-F1dZbif_nnaWXu-xygOpaTtLGTw%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/CABbyFmrALUVDXe1y7sstZGmrZmRNcQHEUN_LC-GamVWLJTOW6A%40mail.gmail.com.

Re: [prometheus-users] Correlation between snmp scrape time and massive rate output for ifHCInOctets

Reply via email to