[prometheus-users] Re: All Samples Lost when prometheus server return 500 to prometheus agent

'Brian Candler' via Prometheus Users Sun, 19 May 2024 22:37:57 -0700

> server returned HTTP status 500 Internal Server Error: too old sample

This is not the server failing to process the data; it's the client 
supplying invalid data. You found that this has been fixed to a 400.


> server returned HTTP status 500 Internal Server Error: label name 
\"prometheus\" is not unique: invalid sample

I can't speak for the authors, but it looks to me like that should be a 400 
as well.

On Monday 20 May 2024 at 04:52:03 UTC+1 koly li wrote:

> Sorry for my poor description. Here is the story:
>
> 1) At first, we are using prometheus v2.47
>
> Then we found all metrics are missing, we check the prometheus log and 
> prometheus agent log:
>
> prometheus log(lots of lines):
> ts=2024-04-19T20:33:26.485Z caller=write_handler.go:76 level=error 
> component=web msg="Error appending remote write" err="too old sample"
> ts=2024-04-19T20:33:26.539Z caller=write_handler.go:76 level=error 
> component=web msg="Error appending remote write" err="too old sample"
> ts=2024-04-19T20:33:26.626Z caller=write_handler.go:76 level=error 
> component=web msg="Error appending remote write" err="too old sample"
> ts=2024-04-19T20:33:26.775Z caller=write_handler.go:76 level=error 
> component=web msg="Error appending remote write" err="too old sample"
> ts=2024-04-19T20:33:27.042Z caller=write_handler.go:76 level=error 
> component=web msg="Error appending remote write" err="too old sample"
> ts=2024-04-19T20:33:27.552Z caller=write_handler.go:76 level=error 
> component=web msg="Error appending remote write" err="too old sample"
> ....
> ts=2024-04-22T03:00:03.327Z caller=write_handler.go:76 level=error 
> component=web msg="Error appending remote write" err="too old sample"
> ts=2024-04-22T03:00:08.394Z caller=write_handler.go:76 level=error 
> component=web msg="Error appending remote write" err="too old sample"
>
> prometheus agent logs:
> ts=2024-04-19T20:33:26.517Z caller=dedupe.go:112 component=remote 
> level=warn remote_name=prometheus-k8s-0 url=
> https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to send 
> batch, retrying" err="server returned HTTP status 500 Internal Server 
> Error: too old sample"
> ts=2024-04-19T20:34:29.714Z caller=dedupe.go:112 component=remote 
> level=warn remote_name=prometheus-k8s-0 url=
> https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to send 
> batch, retrying" err="server returned HTTP status 500 Internal Server 
> Error: too old sample"
> ts=2024-04-19T20:35:30.113Z caller=dedupe.go:112 component=remote 
> level=warn remote_name=prometheus-k8s-0 url=
> https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to send 
> batch, retrying" err="server returned HTTP status 500 Internal Server 
> Error: too old sample"
> ts=2024-04-19T20:36:30.478Z caller=dedupe.go:112 component=remote 
> level=warn remote_name=prometheus-k8s-0 url=
> https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to send 
> batch, retrying" err="server returned HTTP status 500 Internal Server 
> Error: too old sample"
> ....
> ts=2024-04-22T02:56:57.281Z caller=dedupe.go:112 component=remote 
> level=warn remote_name=prometheus-k8s-0 url=
> https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to send 
> batch, retrying" err="server returned HTTP status 500 Internal Server 
> Error: too old sample"
> ts=2024-04-22T02:57:57.624Z caller=dedupe.go:112 component=remote 
> level=warn remote_name=prometheus-k8s-0 url=
> https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to send 
> batch, retrying" err="server returned HTTP status 500 Internal Server 
> Error: too old sample"
> ts=2024-04-22T02:58:57.943Z caller=dedupe.go:112 component=remote 
> level=warn remote_name=prometheus-k8s-0 url=
> https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to send 
> batch, retrying" err="server returned HTTP status 500 Internal Server 
> Error: too old sample"
> ts=2024-04-22T02:59:58.267Z caller=dedupe.go:112 component=remote 
> level=warn remote_name=prometheus-k8s-0 url=
> https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to send 
> batch, retrying" err="server returned HTTP status 500 Internal Server 
> Error: too old sample"
> ts=2024-04-22T03:00:58.733Z caller=dedupe.go:112 component=remote 
> level=warn remote_name=prometheus-k8s-0 url=
> https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to send 
> batch, retrying" err="server returned HTTP status 500 Internal Server 
> Error: too old sample"
>
> Then we check the codes:
>
> https://github.com/prometheus/prometheus/blob/release-2.47/storage/remote/write_handler.go#L77
>
> The "too old sample" is considered an 500. And the agent keeps retrying 
> (exit only when the error is not Recoverable, and 500 is considered 
> Recoverable):
>
> https://github.com/prometheus/prometheus/blob/release-2.51/storage/remote/queue_manager.go#L1670
>
> > You may have come across a bug where a *particular* piece of data being 
> sent by the agent was causing a *particular* version of prometheus to fail 
> with a 5xx internal error every time. The logs should make it clear if this 
> was happening.
> We guess there is one or more samples with too old timestamps which cause 
> the problem. One or more samples with "too old timestamp" cause the 
> prometheus agent to retry forever (agent receives 500) , which prevents new 
> samples to be sent. 
>
> Because there is no logging about the incorrect samples, we cannot figure 
> out what these samples are. It logs samples only for several errors: 
> https://github.com/prometheus/prometheus/blob/release-2.47/storage/remote/write_handler.go#L132
>
> >  The fundamental issue here is, why should restarting the *agent* cause 
> the prometheus *server* to stop returning 500 errors?
> We restart the agent 1~2 days after the problem occurs. The new data does 
> not contain too old samples. That's why 500 errors disappear.
>
> 2) then we upgrade to v2.51
> The new version returns 400 for "too old samples":
>
> https://github.com/prometheus/prometheus/blob/release-2.51/storage/remote/write_handler.go#L72
>
> However, we encountered another 500:
> prometheus agent log:
> ts=2024-05-11T08:42:01.235Z caller=dedupe.go:112 component=remote 
> level=warn remote_name=prometheus-k8s-0 url=
> https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to send 
> batch, retrying" err="server returned HTTP status 500 Internal Server 
> Error: label name \"prometheus\" is not unique: invalid sample"
> ts=2024-05-11T08:42:02.749Z caller=dedupe.go:112 component=remote 
> level=warn remote_name=prometheus-k8s-0 url=
> https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to send 
> batch, retrying" err="server returned HTTP status 500 Internal Server 
> Error: label name \"service\" is not unique: invalid sample"
> ts=2024-05-11T08:42:02.798Z caller=dedupe.go:112 component=remote 
> level=warn remote_name=prometheus-k8s-0 url=
> https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to send 
> batch, retrying" err="server returned HTTP status 500 Internal Server 
> Error: label name \"resourceType\" is not unique: invalid sample"
> ts=2024-05-11T08:42:02.851Z caller=dedupe.go:112 component=remote 
> level=warn remote_name=prometheus-k8s-0 url=
> https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to send 
> batch, retrying" err="server returned HTTP status 500 Internal Server 
> Error: label name \"namespace\" is not unique: invalid sample"
>
> we modify the code to log samples, then we get the prometheus log:
> ts=2024-05-11T08:42:26.603Z caller=write_handler.go:134 level=error 
> component=web msg="unknown error from remote write" err="label name 
> \"resourceId\" is not unique: invalid sample" 
> series="{__name__=\"ovs_vswitchd_interface_resets_total\", 
> clusterName=\"clustertest150\", clusterRegion=\"region0\", 
> clusterZone=\"zone1\", container=\"kube-rbac-proxy\", 
> endpoint=\"ovs-metrics\", hostname=\"20230428-wangbo-dev16\", 
> if_name=\"veth99fa6555\", instance=\"10.253.58.238:9983\", 
> job=\"net-monitor-vnet-ovs\", namespace=\"net-monitor\", 
> pod=\"net-monitor-vnet-ovs-66bdz\", prometheus=\"monitoring/agent-0\", 
> prometheus_replica=\"prometheus-agent-0-0\", 
> resourceId=\"port-naqoi5tmkg5lrt0ubw\", resourceId=\"blb-74se39mqa9k3\", 
> resourceType=\"Port\", resourceType=\"BLB\", rs_ip=\"10.0.0.3\", 
> service=\"net-monitor-vnet-ovs\", service=\"net-monitor-vnet-ovs\", 
> subnet_Id=\"snet-ztojflwrnd08xf5idw\", vip=\"11.4.2.64\", 
> vpc_Id=\"vpc-6ss1uz29ctpfv0eqbj\", vpcid=\"11.4.2.64\"}" 
> timestamp=1715349156000
> ts=2024-05-11T08:42:26.603Z caller=write_handler.go:76 level=error 
> component=web msg="Error appending remote write" err="label name 
> \"resourceId\" is not unique: invalid sample"
> ts=2024-05-11T08:42:26.967Z caller=write_handler.go:134 level=error 
> component=web msg="unknown error from remote write" err="label name 
> \"service\" is not unique: invalid sample" 
> series="{__name__=\"rest_client_request_size_bytes_bucket\", 
> clusterName=\"clustertest150\", clusterRegion=\"region0\", 
> clusterZone=\"zone1\", container=\"kube-scheduler\", endpoint=\"https\", 
> host=\"127.0.0.1:6443\", instance=\"10.253.58.236:10259\", 
> job=\"scheduler\", le=\"262144\", namespace=\"kube-scheduler\", 
> pod=\"kube-scheduler-20230428-wangbo-dev14\", 
> prometheus=\"monitoring/agent-0\", 
> prometheus_replica=\"prometheus-agent-0-0\", resourceType=\"NETWORK-HOST\", 
> service=\"scheduler\", service=\"net-monitor-vnet-ovs\", verb=\"POST\"}" 
> timestamp=1715349164522
> ts=2024-05-11T08:42:26.967Z caller=write_handler.go:76 level=error 
> component=web msg="Error appending remote write" err="label name 
> \"service\" is not unique: invalid sample"
> ts=2024-05-11T08:42:27.091Z caller=write_handler.go:134 level=error 
> component=web msg="unknown error from remote write" err="label name 
> \"prometheus_replica\" is not unique: invalid sample" 
> series="{__name__=\"workqueue_work_duration_seconds_sum\", 
> clusterName=\"clustertest150\", clusterRegion=\"region0\", 
> clusterZone=\"zone1\", endpoint=\"https\", instance=\"21.100.10.52:8443\", 
> job=\"metrics\", name=\"ResourceSyncController\", 
> namespace=\"service-ca-operator\", 
> pod=\"service-ca-operator-645cfdbfb6-rjr4z\", 
> prometheus=\"monitoring/agent-0\", 
> prometheus_replica=\"prometheus-agent-0-0\", 
> prometheus_replica=\"prometheus-agent-0-0\", service=\"metrics\"}" 
> timestamp=1715349271085
> ts=2024-05-11T08:42:27.091Z caller=write_handler.go:76 level=error 
> component=web msg="Error appending remote write" err="label name 
> \"prometheus_replica\" is not unique: invalid sample"
>
> Currently we dont' know why there are duplicated labels. But when the 
> server encounters duplicated labels, it returns 500. Then the agent keeps 
> retrying, which means new samples cannot be handled.
>
> we set external_labels in prometheus-agent configs:
> global:
>   evaluation_interval: 30s
>   scrape_interval: 5m
>   scrape_timeout: 1m
>   external_labels:
>     clusterName: clustertest150
>     clusterRegion: region0
>     clusterZone: zone1
>     prometheus: ccos-monitoring/agent-0
>     prometheus_replica: prometheus-agent-0-0
>   keep_dropped_targets: 1
>
> and the remote write config:
> remote_write:
> - url: https://prometheus-k8s-0.monitoring:9091/api/v1/write
>   remote_timeout: 30s
>   name: prometheus-k8s-0
>   write_relabel_configs:
>   - target_label: __tmp_cluster_id__
>     replacement: 713c30cb-81c3-411d-b4dc-0c775a0f9564
>     action: replace
>   - regex: __tmp_cluster_id__
>     action: labeldrop
>   bearer_token: XDFSDF...
>   tls_config:
>     insecure_skip_verify: true
>   queue_config:
>     capacity: 10000
>     min_shards: 1
>     max_shards: 500
>     max_samples_per_send: 2000
>     batch_send_deadline: 10s
>     min_backoff: 30ms
>     max_backoff: 5s
>     sample_age_limit: 5m
>
> > You are saying that you would prefer the agent to throw away data, 
> rather than hold onto the data and try again later when it may succeed. In 
> this situation, retrying is normally the correct thing to do.
> Yes, retry is the normal solution. But there should be maximum number of 
> retries. We notice that prometheus agent sets the retry nubmers to the 
> request header, but it seems the request header is not used by the server.
>
> prometheus-agent sets the retry numbers to request header:
>
> https://github.com/prometheus/prometheus/blob/release-2.51/storage/remote/client.go#L214
>
> Besides, if some samples is incorrect and others are correct in the same 
> request, why don't prometheus server save the correct part and drop the 
> wrong part? It is more complicated as retry should be considered, but is it 
> possible to save partial data and return 206 when the maximum number of 
> retries is reached? 
>
> And should prometheus server log samples for all kinds of error?
>
> https://github.com/prometheus/prometheus/blob/release-2.51/storage/remote/write_handler.go#L133
>
> On Friday, May 17, 2024 at 8:15:04 PM UTC+8 Brian Candler wrote:
>
>> It's difficult to make sense of what you're saying. Without seeing logs 
>> from both the agent and the server while this problem was occurring (e.g. 
>> `journalctl -eu prometheus`), it's hard to know what was really happening. 
>> Also you need to say what exact versions of prometheus and the agent were 
>> running.
>>
>> The fundamental issue here is, why should restarting the *agent* cause 
>> the prometheus *server* to stop returning 500 errors?
>>
>>
>> > So my question is why 5xx from the promtheus server is considered 
>> Recoverable?
>>
>> It is by definition of the HTTP protocol: 
>> https://datatracker.ietf.org/doc/html/rfc2616#section-10.5
>>
>> Actually it depends on exactly which 5xx error code you're talking about, 
>> but common 500 and 503 errors are generally transient, meaning there was a 
>> problem at the server and the request may succeed if tried again later.  If 
>> the prometheus server wanted to tell the client that the request was 
>> invalid and could never possibly succeed, then it would return a 4xx error.
>>
>> > And I believe there should be a way to exit the loop, for example a 
>> maximum times to  retry.
>>
>> You are saying that you would prefer the agent to throw away data, rather 
>> than hold onto the data and try again later when it may succeed. In this 
>> situation, retrying is normally the correct thing to do.
>>
>> You may have come across a bug where a *particular* piece of data being 
>> sent by the agent was causing a *particular* version of prometheus to fail 
>> with a 5xx internal error every time. The logs should make it clear if this 
>> was happening.
>>
>> On Friday 17 May 2024 at 10:02:49 UTC+1 koly li wrote:
>>
>>> Hello all,
>>>
>>> Recently we found that our samples are all lost. After some 
>>> investigation, we found:
>>> 1, we are using prometheus agent to send all data to prometheus server 
>>> by remote write
>>> 2, the agent sample sending code is in storage\remote\queue_manager.go, 
>>> the function is sendWriteRequestWithBackoff()
>>> 3, inside the function, if attempt(the function where request is made to 
>>> prometheus server) function returns an Recoverable Error, then it will 
>>> retry sending the request
>>> 4, when a Recoverable error is returned? one scenario is the prometheus 
>>> server returned 5xx error
>>> 5, I think not every 5xx error is recoverable, and there is no other way 
>>> to exit the for loop in sendWriteRequestWithBackoff(). The agent keeps 
>>> retrying but every time it receives an 5xx from the server. so we lost all 
>>> samples for hours until we restart the agent
>>>
>>> So my question is why 5xx from the promtheus server is considered 
>>> Recoverable? And I believe there should be a way to exit the loop, for 
>>> example a maximum times to  retry.
>>>
>>> It seems that the agent mode is not mature enough to work in production.
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/6d7fddc3-0006-4f54-902a-605253c0c670n%40googlegroups.com.

[prometheus-users] Re: All Samples Lost when prometheus server return 500 to prometheus agent

Reply via email to