[prometheus-users] Re: All Samples Lost when prometheus server return 500 to prometheus agent

'Brian Candler' via Prometheus Users Fri, 17 May 2024 05:15:10 -0700

It's difficult to make sense of what you're saying. Without seeing logs 
from both the agent and the server while this problem was occurring (e.g. 
`journalctl -eu prometheus`), it's hard to know what was really happening. 
Also you need to say what exact versions of prometheus and the agent were 
running.

The fundamental issue here is, why should restarting the *agent* cause the 
prometheus *server* to stop returning 500 errors?

> So my question is why 5xx from the promtheus server is considered 
Recoverable?

It is by definition of the HTTP protocol: 
https://datatracker.ietf.org/doc/html/rfc2616#section-10.5

Actually it depends on exactly which 5xx error code you're talking about, 
but common 500 and 503 errors are generally transient, meaning there was a 
problem at the server and the request may succeed if tried again later.  If 
the prometheus server wanted to tell the client that the request was 
invalid and could never possibly succeed, then it would return a 4xx error.

> And I believe there should be a way to exit the loop, for example a 
maximum times to  retry.

You are saying that you would prefer the agent to throw away data, rather 
than hold onto the data and try again later when it may succeed. In this 
situation, retrying is normally the correct thing to do.

You may have come across a bug where a *particular* piece of data being 
sent by the agent was causing a *particular* version of prometheus to fail 
with a 5xx internal error every time. The logs should make it clear if this 
was happening.

On Friday 17 May 2024 at 10:02:49 UTC+1 koly li wrote:

> Hello all,
>
> Recently we found that our samples are all lost. After some investigation, 
> we found:
> 1, we are using prometheus agent to send all data to prometheus server by 
> remote write
> 2, the agent sample sending code is in storage\remote\queue_manager.go, 
> the function is sendWriteRequestWithBackoff()
> 3, inside the function, if attempt(the function where request is made to 
> prometheus server) function returns an Recoverable Error, then it will 
> retry sending the request
> 4, when a Recoverable error is returned? one scenario is the prometheus 
> server returned 5xx error
> 5, I think not every 5xx error is recoverable, and there is no other way 
> to exit the for loop in sendWriteRequestWithBackoff(). The agent keeps 
> retrying but every time it receives an 5xx from the server. so we lost all 
> samples for hours until we restart the agent
>
> So my question is why 5xx from the promtheus server is considered 
> Recoverable? And I believe there should be a way to exit the loop, for 
> example a maximum times to  retry.
>
> It seems that the agent mode is not mature enough to work in production.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/099dd271-0797-4f07-8ce5-700f3d552317n%40googlegroups.com.

[prometheus-users] Re: All Samples Lost when prometheus server return 500 to prometheus agent

Reply via email to