Re: [prometheus-users] Correlation between snmp scrape time and massive rate output for ifHCInOctets

Ben Kochie Sat, 16 Mar 2024 02:38:09 -0700

You can also execute the query via the Prometheus compatible API.

https://prometheus.io/docs/prometheus/latest/querying/api/#instant-queries


The same can be done via the Grafana datasource API endpoint.

> managed endpoint and then the other end supposedly deduplicates the
metrics

This is 99% likely the problem. The remote storage is deduplicating, but
it's flip-flopping between your two Prometheus instances data. Each
prometheus consistently psudo-randomizes the exact millisecond of the
scrape time to avoid load spikes on the targets. Since each Prometheus
instance is scraping at slightly different times, if the remote TSDB is
inserting one that is slightly older, a "newer" sample may actually be
slightly lower values from the devices. This tricks Prometheus into
thinking there was a counter reset, so it thinks there was the full
counter's value of data between the two scrapes.

There are a few options:
* Use only one Prometheus server for SNMP targets to avoid the
deduplication happening on your remote write storage.
* Setup a caching HTTP reverse proxy between your Prometheus instances and
the snmp_exporter with a cache TTL that matches your scrape interval.
* Wait for / contribute to SNMP walk caching in the snmp_exporter.

I would love to add a full SNMP walk cache to the snmp_exporter. I would
like to support memcached/redis as well for clustering persistence. But
since my $dayjob has no SNMP, it's hard for me to prioritize work on it.

On Sat, Mar 16, 2024 at 9:39 AM Nick Carlton <[email protected]>
wrote:

> Thanks both,
>
> I must be honest I never managed to get the generator to work with mib
> dependencies so have written my snmp.yml manually with other lookups etc so
> have never seen these values documented.
>
> Is there a best practice guide for their values when you are having
> certain issues or used in a way to speed up SNMP scrapes? I can’t seem to
> find any solid documentation.
>
> Ben - I’ll try and get that data, but this is a managed Prometheus so I
> don’t have access to the main Prometheus UI, just a built in version.
> Thought it should give me the same data. It’s possible there is duplicate
> data here because there are two Prometheus boxes polling these switches for
> the same metrics and sending duplicate data over remote write to the
> managed endpoint and then the other end supposedly deduplicates the
> metrics. Is there any way to defend against this on the side I can control?
>
> Thanks
> Nick
>
> On Sat, 16 Mar 2024 at 08:10, Alexander Wilke <[email protected]>
> wrote:
>
>> https://github.com/prometheus/snmp_exporter/tree/main/generator
>>
>> Alexander Wilke schrieb am Samstag, 16. März 2024 um 09:08:44 UTC+1:
>>
>>> Check File Format example.
>>>
>>> Time Out, retries, max-repetition.
>>>
>>> I use Repetition 50 or 100 with Cisco, retries 0 and Time Out 1s or
>>> 500ms below Prometheus timeout
>>>
>>> Ben Kochie schrieb am Samstag, 16. März 2024 um 06:31:17 UTC+1:
>>>
>>>> This is very likely a problem with counter resets or some other kind of
>>>> duplicate data.
>>>>
>>>> The best way to figure this out is to perform the query, but without
>>>> the `rate()` function.
>>>>
>>>> This can be done via the Prometheus UI (harder to do in Grafana) in the
>>>> "Table" view.
>>>>
>>>> Here is an example demo query
>>>> <https://prometheus.demo.do.prometheus.io/graph?g0.expr=process_cpu_seconds_total%7Bjob%3D%22prometheus%22%7D%5B2m%5D&g0.tab=1&g0.display_mode=lines&g0.show_exemplars=0&g0.range_input=1h>
>>>>
>>>> The results is a list of the raw samples that are needed to debug.
>>>>
>>>> On Fri, Mar 15, 2024 at 11:41 PM Nick Carlton <[email protected]>
>>>> wrote:
>>>>
>>>>> Hello Everyone,
>>>>>
>>>>> I have just seen something weird in my environment where I saw
>>>>> interface bandwidth on a gigabit switch reach about 1tbps on some of the
>>>>> interfaces.....
>>>>>
>>>>> Here is the query im using:
>>>>>
>>>>> rate(ifHCInOctets{ifHCInOctetsIntfName=~".*.\\/.*.",instance="<device-name>"}[2m])
>>>>> * 8
>>>>>
>>>>> Which ive never had a problem with. Here is an image of the graph
>>>>> showing the massive increase in bandwidth and then decrease back to 
>>>>> normal:
>>>>>
>>>>> [image: Screenshot 2024-03-15 222353.png]
>>>>>
>>>>> When Ive done some more investigation into what could have happened, I
>>>>> can see that the 'snmp_scrape_duration_seconds' metric increases to around
>>>>> 20s at the time. So the cisco switch is talking 20 seconds to respond to
>>>>> the SNMP request.
>>>>>
>>>>> [image: Screenshot 2024-03-15 222244.png]
>>>>>
>>>>> Im a bit confused as to how this could cause the rate query to give
>>>>> completely false data? Could the delay in data have caused prometheus to
>>>>> think there was more bandwidth on the interface? The switch certainly
>>>>> cannot do the speeds the graph is claiming!
>>>>>
>>>>> Im on v0.25.0 on the SNMP exporter and its normally sat around 2s for
>>>>> the scrapes. Im not blaming the exporter for the high response times, 
>>>>> thats
>>>>> probably the switch. Just wondering if in some way the high response time
>>>>> could cause the rate query to give incorrect data. The fact the graph went
>>>>> back to normal post the high reponse times makes me think it wasn't the
>>>>> switch giving duff data.
>>>>>
>>>>> Anyone seen this before and is there any way to mitigate? Happy to
>>>>> provide more info if required :)
>>>>>
>>>>> Thanks
>>>>> Nick
>>>>>
>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "Prometheus Users" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to [email protected].
>>>>> To view this discussion on the web visit
>>>>> https://groups.google.com/d/msgid/prometheus-users/6fd3dca6-2013-47ad-af8f-3344e79954a7n%40googlegroups.com
>>>>> <https://groups.google.com/d/msgid/prometheus-users/6fd3dca6-2013-47ad-af8f-3344e79954a7n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>>
>>>> --
>> You received this message because you are subscribed to a topic in the
>> Google Groups "Prometheus Users" group.
>> To unsubscribe from this topic, visit
>> https://groups.google.com/d/topic/prometheus-users/poGtu50nisA/unsubscribe
>> .
>> To unsubscribe from this group and all its topics, send an email to
>> [email protected].
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/prometheus-users/4cd1e6b8-fa73-4ee0-92c0-c504c161870bn%40googlegroups.com
>> <https://groups.google.com/d/msgid/prometheus-users/4cd1e6b8-fa73-4ee0-92c0-c504c161870bn%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
> --
> You received this message because you are subscribed to the Google Groups
> "Prometheus Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/prometheus-users/CAC4WY5-ZthvvcKqLsta0v%2B-F1dZbif_nnaWXu-xygOpaTtLGTw%40mail.gmail.com
> <https://groups.google.com/d/msgid/prometheus-users/CAC4WY5-ZthvvcKqLsta0v%2B-F1dZbif_nnaWXu-xygOpaTtLGTw%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/CABbyFmprF_v7r%3DzN001zLNZ-DeXV1e5V_vC87ouVzR8hMT%3DPGw%40mail.gmail.com.

Re: [prometheus-users] Correlation between snmp scrape time and massive rate output for ifHCInOctets

Reply via email to