Hello,
1.) is the timeout of 50s the same on prometheus scrape_config and snmp.yml
file?
2.) is this really the name of the interface? ifHCInOctetsIntfName
3) the =~".*.\\/.*." maybo shows many interfaces, maybe som internal
loppback which may count traffic twice? Further it may show PortChannel
(Po) and then VLAN (Po.xy) and physical interfaces !?
I am not sure but the screenshost show "stacked lines" - is it possible
that in the first screenshot the throughput of all interfaces was stacked ?
Nick Carlton schrieb am Freitag, 15. März 2024 um 23:43:19 UTC+1:
> To clarify, my scrapes for this data run every 1m and have a timeout of 50s
>
> On Friday 15 March 2024 at 22:41:52 UTC Nick Carlton wrote:
>
>> Hello Everyone,
>>
>> I have just seen something weird in my environment where I saw interface
>> bandwidth on a gigabit switch reach about 1tbps on some of the
>> interfaces.....
>>
>> Here is the query im using:
>>
>> rate(ifHCInOctets{ifHCInOctetsIntfName=~".*.\\/.*.",instance="<device-name>"}[2m])
>>
>> * 8
>>
>> Which ive never had a problem with. Here is an image of the graph showing
>> the massive increase in bandwidth and then decrease back to normal:
>>
>> [image: Screenshot 2024-03-15 222353.png]
>>
>> When Ive done some more investigation into what could have happened, I
>> can see that the 'snmp_scrape_duration_seconds' metric increases to around
>> 20s at the time. So the cisco switch is talking 20 seconds to respond to
>> the SNMP request.
>>
>> [image: Screenshot 2024-03-15 222244.png]
>>
>> Im a bit confused as to how this could cause the rate query to give
>> completely false data? Could the delay in data have caused prometheus to
>> think there was more bandwidth on the interface? The switch certainly
>> cannot do the speeds the graph is claiming!
>>
>> Im on v0.25.0 on the SNMP exporter and its normally sat around 2s for the
>> scrapes. Im not blaming the exporter for the high response times, thats
>> probably the switch. Just wondering if in some way the high response time
>> could cause the rate query to give incorrect data. The fact the graph went
>> back to normal post the high reponse times makes me think it wasn't the
>> switch giving duff data.
>>
>> Anyone seen this before and is there any way to mitigate? Happy to
>> provide more info if required :)
>>
>> Thanks
>> Nick
>>
>
--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/12ddad30-e5f8-4fb1-9869-45095b48b647n%40googlegroups.com.