It's difficult to make sense of what you're saying. Without seeing logs from both the agent and the server while this problem was occurring (e.g. `journalctl -eu prometheus`), it's hard to know what was really happening. Also you need to say what exact versions of prometheus and the agent were running.
The fundamental issue here is, why should restarting the *agent* cause the prometheus *server* to stop returning 500 errors? > So my question is why 5xx from the promtheus server is considered Recoverable? It is by definition of the HTTP protocol: https://datatracker.ietf.org/doc/html/rfc2616#section-10.5 Actually it depends on exactly which 5xx error code you're talking about, but common 500 and 503 errors are generally transient, meaning there was a problem at the server and the request may succeed if tried again later. If the prometheus server wanted to tell the client that the request was invalid and could never possibly succeed, then it would return a 4xx error. > And I believe there should be a way to exit the loop, for example a maximum times to retry. You are saying that you would prefer the agent to throw away data, rather than hold onto the data and try again later when it may succeed. In this situation, retrying is normally the correct thing to do. You may have come across a bug where a *particular* piece of data being sent by the agent was causing a *particular* version of prometheus to fail with a 5xx internal error every time. The logs should make it clear if this was happening. On Friday 17 May 2024 at 10:02:49 UTC+1 koly li wrote: > Hello all, > > Recently we found that our samples are all lost. After some investigation, > we found: > 1, we are using prometheus agent to send all data to prometheus server by > remote write > 2, the agent sample sending code is in storage\remote\queue_manager.go, > the function is sendWriteRequestWithBackoff() > 3, inside the function, if attempt(the function where request is made to > prometheus server) function returns an Recoverable Error, then it will > retry sending the request > 4, when a Recoverable error is returned? one scenario is the prometheus > server returned 5xx error > 5, I think not every 5xx error is recoverable, and there is no other way > to exit the for loop in sendWriteRequestWithBackoff(). The agent keeps > retrying but every time it receives an 5xx from the server. so we lost all > samples for hours until we restart the agent > > So my question is why 5xx from the promtheus server is considered > Recoverable? And I believe there should be a way to exit the loop, for > example a maximum times to retry. > > It seems that the agent mode is not mature enough to work in production. > -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/099dd271-0797-4f07-8ce5-700f3d552317n%40googlegroups.com.

