[prometheus-users] Re: All Samples Lost when prometheus server return 500 to prometheus agent

koly li Mon, 20 May 2024 01:25:54 -0700

Thanks a lot, Brian. I think I should create an issue.

On Monday, May 20, 2024 at 1:37:51 PM UTC+8 Brian Candler wrote:


> > server returned HTTP status 500 Internal Server Error: too old sample
>
> This is not the server failing to process the data; it's the client 
> supplying invalid data. You found that this has been fixed to a 400.
>
> > server returned HTTP status 500 Internal Server Error: label name 
> \"prometheus\" is not unique: invalid sample
>
> I can't speak for the authors, but it looks to me like that should be a 
> 400 as well.
>
> On Monday 20 May 2024 at 04:52:03 UTC+1 koly li wrote:
>
>> Sorry for my poor description. Here is the story:
>>
>> 1) At first, we are using prometheus v2.47
>>
>> Then we found all metrics are missing, we check the prometheus log and 
>> prometheus agent log:
>>
>> prometheus log(lots of lines):
>> ts=2024-04-19T20:33:26.485Z caller=write_handler.go:76 level=error 
>> component=web msg="Error appending remote write" err="too old sample"
>> ts=2024-04-19T20:33:26.539Z caller=write_handler.go:76 level=error 
>> component=web msg="Error appending remote write" err="too old sample"
>> ts=2024-04-19T20:33:26.626Z caller=write_handler.go:76 level=error 
>> component=web msg="Error appending remote write" err="too old sample"
>> ts=2024-04-19T20:33:26.775Z caller=write_handler.go:76 level=error 
>> component=web msg="Error appending remote write" err="too old sample"
>> ts=2024-04-19T20:33:27.042Z caller=write_handler.go:76 level=error 
>> component=web msg="Error appending remote write" err="too old sample"
>> ts=2024-04-19T20:33:27.552Z caller=write_handler.go:76 level=error 
>> component=web msg="Error appending remote write" err="too old sample"
>> ....
>> ts=2024-04-22T03:00:03.327Z caller=write_handler.go:76 level=error 
>> component=web msg="Error appending remote write" err="too old sample"
>> ts=2024-04-22T03:00:08.394Z caller=write_handler.go:76 level=error 
>> component=web msg="Error appending remote write" err="too old sample"
>>
>> prometheus agent logs:
>> ts=2024-04-19T20:33:26.517Z caller=dedupe.go:112 component=remote 
>> level=warn remote_name=prometheus-k8s-0 url=
>> https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to 
>> send batch, retrying" err="server returned HTTP status 500 Internal Server 
>> Error: too old sample"
>> ts=2024-04-19T20:34:29.714Z caller=dedupe.go:112 component=remote 
>> level=warn remote_name=prometheus-k8s-0 url=
>> https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to 
>> send batch, retrying" err="server returned HTTP status 500 Internal Server 
>> Error: too old sample"
>> ts=2024-04-19T20:35:30.113Z caller=dedupe.go:112 component=remote 
>> level=warn remote_name=prometheus-k8s-0 url=
>> https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to 
>> send batch, retrying" err="server returned HTTP status 500 Internal Server 
>> Error: too old sample"
>> ts=2024-04-19T20:36:30.478Z caller=dedupe.go:112 component=remote 
>> level=warn remote_name=prometheus-k8s-0 url=
>> https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to 
>> send batch, retrying" err="server returned HTTP status 500 Internal Server 
>> Error: too old sample"
>> ....
>> ts=2024-04-22T02:56:57.281Z caller=dedupe.go:112 component=remote 
>> level=warn remote_name=prometheus-k8s-0 url=
>> https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to 
>> send batch, retrying" err="server returned HTTP status 500 Internal Server 
>> Error: too old sample"
>> ts=2024-04-22T02:57:57.624Z caller=dedupe.go:112 component=remote 
>> level=warn remote_name=prometheus-k8s-0 url=
>> https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to 
>> send batch, retrying" err="server returned HTTP status 500 Internal Server 
>> Error: too old sample"
>> ts=2024-04-22T02:58:57.943Z caller=dedupe.go:112 component=remote 
>> level=warn remote_name=prometheus-k8s-0 url=
>> https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to 
>> send batch, retrying" err="server returned HTTP status 500 Internal Server 
>> Error: too old sample"
>> ts=2024-04-22T02:59:58.267Z caller=dedupe.go:112 component=remote 
>> level=warn remote_name=prometheus-k8s-0 url=
>> https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to 
>> send batch, retrying" err="server returned HTTP status 500 Internal Server 
>> Error: too old sample"
>> ts=2024-04-22T03:00:58.733Z caller=dedupe.go:112 component=remote 
>> level=warn remote_name=prometheus-k8s-0 url=
>> https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to 
>> send batch, retrying" err="server returned HTTP status 500 Internal Server 
>> Error: too old sample"
>>
>> Then we check the codes:
>>
>> https://github.com/prometheus/prometheus/blob/release-2.47/storage/remote/write_handler.go#L77
>>
>> The "too old sample" is considered an 500. And the agent keeps retrying 
>> (exit only when the error is not Recoverable, and 500 is considered 
>> Recoverable):
>>
>> https://github.com/prometheus/prometheus/blob/release-2.51/storage/remote/queue_manager.go#L1670
>>
>> > You may have come across a bug where a *particular* piece of data being 
>> sent by the agent was causing a *particular* version of prometheus to fail 
>> with a 5xx internal error every time. The logs should make it clear if this 
>> was happening.
>> We guess there is one or more samples with too old timestamps which cause 
>> the problem. One or more samples with "too old timestamp" cause the 
>> prometheus agent to retry forever (agent receives 500) , which prevents new 
>> samples to be sent. 
>>
>> Because there is no logging about the incorrect samples, we cannot figure 
>> out what these samples are. It logs samples only for several errors: 
>> https://github.com/prometheus/prometheus/blob/release-2.47/storage/remote/write_handler.go#L132
>>
>> >  The fundamental issue here is, why should restarting the *agent* cause 
>> the prometheus *server* to stop returning 500 errors?
>> We restart the agent 1~2 days after the problem occurs. The new data does 
>> not contain too old samples. That's why 500 errors disappear.
>>
>> 2) then we upgrade to v2.51
>> The new version returns 400 for "too old samples":
>>
>> https://github.com/prometheus/prometheus/blob/release-2.51/storage/remote/write_handler.go#L72
>>
>> However, we encountered another 500:
>> prometheus agent log:
>> ts=2024-05-11T08:42:01.235Z caller=dedupe.go:112 component=remote 
>> level=warn remote_name=prometheus-k8s-0 url=
>> https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to 
>> send batch, retrying" err="server returned HTTP status 500 Internal Server 
>> Error: label name \"prometheus\" is not unique: invalid sample"
>> ts=2024-05-11T08:42:02.749Z caller=dedupe.go:112 component=remote 
>> level=warn remote_name=prometheus-k8s-0 url=
>> https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to 
>> send batch, retrying" err="server returned HTTP status 500 Internal Server 
>> Error: label name \"service\" is not unique: invalid sample"
>> ts=2024-05-11T08:42:02.798Z caller=dedupe.go:112 component=remote 
>> level=warn remote_name=prometheus-k8s-0 url=
>> https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to 
>> send batch, retrying" err="server returned HTTP status 500 Internal Server 
>> Error: label name \"resourceType\" is not unique: invalid sample"
>> ts=2024-05-11T08:42:02.851Z caller=dedupe.go:112 component=remote 
>> level=warn remote_name=prometheus-k8s-0 url=
>> https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to 
>> send batch, retrying" err="server returned HTTP status 500 Internal Server 
>> Error: label name \"namespace\" is not unique: invalid sample"
>>
>> we modify the code to log samples, then we get the prometheus log:
>> ts=2024-05-11T08:42:26.603Z caller=write_handler.go:134 level=error 
>> component=web msg="unknown error from remote write" err="label name 
>> \"resourceId\" is not unique: invalid sample" 
>> series="{__name__=\"ovs_vswitchd_interface_resets_total\", 
>> clusterName=\"clustertest150\", clusterRegion=\"region0\", 
>> clusterZone=\"zone1\", container=\"kube-rbac-proxy\", 
>> endpoint=\"ovs-metrics\", hostname=\"20230428-wangbo-dev16\", 
>> if_name=\"veth99fa6555\", instance=\"10.253.58.238:9983\", 
>> job=\"net-monitor-vnet-ovs\", namespace=\"net-monitor\", 
>> pod=\"net-monitor-vnet-ovs-66bdz\", prometheus=\"monitoring/agent-0\", 
>> prometheus_replica=\"prometheus-agent-0-0\", 
>> resourceId=\"port-naqoi5tmkg5lrt0ubw\", resourceId=\"blb-74se39mqa9k3\", 
>> resourceType=\"Port\", resourceType=\"BLB\", rs_ip=\"10.0.0.3\", 
>> service=\"net-monitor-vnet-ovs\", service=\"net-monitor-vnet-ovs\", 
>> subnet_Id=\"snet-ztojflwrnd08xf5idw\", vip=\"11.4.2.64\", 
>> vpc_Id=\"vpc-6ss1uz29ctpfv0eqbj\", vpcid=\"11.4.2.64\"}" 
>> timestamp=1715349156000
>> ts=2024-05-11T08:42:26.603Z caller=write_handler.go:76 level=error 
>> component=web msg="Error appending remote write" err="label name 
>> \"resourceId\" is not unique: invalid sample"
>> ts=2024-05-11T08:42:26.967Z caller=write_handler.go:134 level=error 
>> component=web msg="unknown error from remote write" err="label name 
>> \"service\" is not unique: invalid sample" 
>> series="{__name__=\"rest_client_request_size_bytes_bucket\", 
>> clusterName=\"clustertest150\", clusterRegion=\"region0\", 
>> clusterZone=\"zone1\", container=\"kube-scheduler\", endpoint=\"https\", 
>> host=\"127.0.0.1:6443\", instance=\"10.253.58.236:10259\", 
>> job=\"scheduler\", le=\"262144\", namespace=\"kube-scheduler\", 
>> pod=\"kube-scheduler-20230428-wangbo-dev14\", 
>> prometheus=\"monitoring/agent-0\", 
>> prometheus_replica=\"prometheus-agent-0-0\", resourceType=\"NETWORK-HOST\", 
>> service=\"scheduler\", service=\"net-monitor-vnet-ovs\", verb=\"POST\"}" 
>> timestamp=1715349164522
>> ts=2024-05-11T08:42:26.967Z caller=write_handler.go:76 level=error 
>> component=web msg="Error appending remote write" err="label name 
>> \"service\" is not unique: invalid sample"
>> ts=2024-05-11T08:42:27.091Z caller=write_handler.go:134 level=error 
>> component=web msg="unknown error from remote write" err="label name 
>> \"prometheus_replica\" is not unique: invalid sample" 
>> series="{__name__=\"workqueue_work_duration_seconds_sum\", 
>> clusterName=\"clustertest150\", clusterRegion=\"region0\", 
>> clusterZone=\"zone1\", endpoint=\"https\", instance=\"21.100.10.52:8443\", 
>> job=\"metrics\", name=\"ResourceSyncController\", 
>> namespace=\"service-ca-operator\", 
>> pod=\"service-ca-operator-645cfdbfb6-rjr4z\", 
>> prometheus=\"monitoring/agent-0\", 
>> prometheus_replica=\"prometheus-agent-0-0\", 
>> prometheus_replica=\"prometheus-agent-0-0\", service=\"metrics\"}" 
>> timestamp=1715349271085
>> ts=2024-05-11T08:42:27.091Z caller=write_handler.go:76 level=error 
>> component=web msg="Error appending remote write" err="label name 
>> \"prometheus_replica\" is not unique: invalid sample"
>>
>> Currently we dont' know why there are duplicated labels. But when the 
>> server encounters duplicated labels, it returns 500. Then the agent keeps 
>> retrying, which means new samples cannot be handled.
>>
>> we set external_labels in prometheus-agent configs:
>> global:
>>   evaluation_interval: 30s
>>   scrape_interval: 5m
>>   scrape_timeout: 1m
>>   external_labels:
>>     clusterName: clustertest150
>>     clusterRegion: region0
>>     clusterZone: zone1
>>     prometheus: ccos-monitoring/agent-0
>>     prometheus_replica: prometheus-agent-0-0
>>   keep_dropped_targets: 1
>>
>> and the remote write config:
>> remote_write:
>> - url: https://prometheus-k8s-0.monitoring:9091/api/v1/write
>>   remote_timeout: 30s
>>   name: prometheus-k8s-0
>>   write_relabel_configs:
>>   - target_label: __tmp_cluster_id__
>>     replacement: 713c30cb-81c3-411d-b4dc-0c775a0f9564
>>     action: replace
>>   - regex: __tmp_cluster_id__
>>     action: labeldrop
>>   bearer_token: XDFSDF...
>>   tls_config:
>>     insecure_skip_verify: true
>>   queue_config:
>>     capacity: 10000
>>     min_shards: 1
>>     max_shards: 500
>>     max_samples_per_send: 2000
>>     batch_send_deadline: 10s
>>     min_backoff: 30ms
>>     max_backoff: 5s
>>     sample_age_limit: 5m
>>
>> > You are saying that you would prefer the agent to throw away data, 
>> rather than hold onto the data and try again later when it may succeed. In 
>> this situation, retrying is normally the correct thing to do.
>> Yes, retry is the normal solution. But there should be maximum number of 
>> retries. We notice that prometheus agent sets the retry nubmers to the 
>> request header, but it seems the request header is not used by the server.
>>
>> prometheus-agent sets the retry numbers to request header:
>>
>> https://github.com/prometheus/prometheus/blob/release-2.51/storage/remote/client.go#L214
>>
>> Besides, if some samples is incorrect and others are correct in the same 
>> request, why don't prometheus server save the correct part and drop the 
>> wrong part? It is more complicated as retry should be considered, but is it 
>> possible to save partial data and return 206 when the maximum number of 
>> retries is reached? 
>>
>> And should prometheus server log samples for all kinds of error?
>>
>> https://github.com/prometheus/prometheus/blob/release-2.51/storage/remote/write_handler.go#L133
>>
>> On Friday, May 17, 2024 at 8:15:04 PM UTC+8 Brian Candler wrote:
>>
>>> It's difficult to make sense of what you're saying. Without seeing logs 
>>> from both the agent and the server while this problem was occurring (e.g. 
>>> `journalctl -eu prometheus`), it's hard to know what was really happening. 
>>> Also you need to say what exact versions of prometheus and the agent were 
>>> running.
>>>
>>> The fundamental issue here is, why should restarting the *agent* cause 
>>> the prometheus *server* to stop returning 500 errors?
>>>
>>>
>>> > So my question is why 5xx from the promtheus server is considered 
>>> Recoverable?
>>>
>>> It is by definition of the HTTP protocol: 
>>> https://datatracker.ietf.org/doc/html/rfc2616#section-10.5
>>>
>>> Actually it depends on exactly which 5xx error code you're talking 
>>> about, but common 500 and 503 errors are generally transient, meaning there 
>>> was a problem at the server and the request may succeed if tried again 
>>> later.  If the prometheus server wanted to tell the client that the request 
>>> was invalid and could never possibly succeed, then it would return a 4xx 
>>> error.
>>>
>>> > And I believe there should be a way to exit the loop, for example a 
>>> maximum times to  retry.
>>>
>>> You are saying that you would prefer the agent to throw away data, 
>>> rather than hold onto the data and try again later when it may succeed. In 
>>> this situation, retrying is normally the correct thing to do.
>>>
>>> You may have come across a bug where a *particular* piece of data being 
>>> sent by the agent was causing a *particular* version of prometheus to fail 
>>> with a 5xx internal error every time. The logs should make it clear if this 
>>> was happening.
>>>
>>> On Friday 17 May 2024 at 10:02:49 UTC+1 koly li wrote:
>>>
>>>> Hello all,
>>>>
>>>> Recently we found that our samples are all lost. After some 
>>>> investigation, we found:
>>>> 1, we are using prometheus agent to send all data to prometheus server 
>>>> by remote write
>>>> 2, the agent sample sending code is in storage\remote\queue_manager.go, 
>>>> the function is sendWriteRequestWithBackoff()
>>>> 3, inside the function, if attempt(the function where request is made 
>>>> to prometheus server) function returns an Recoverable Error, then it will 
>>>> retry sending the request
>>>> 4, when a Recoverable error is returned? one scenario is the prometheus 
>>>> server returned 5xx error
>>>> 5, I think not every 5xx error is recoverable, and there is no other 
>>>> way to exit the for loop in sendWriteRequestWithBackoff(). The agent keeps 
>>>> retrying but every time it receives an 5xx from the server. so we lost all 
>>>> samples for hours until we restart the agent
>>>>
>>>> So my question is why 5xx from the promtheus server is considered 
>>>> Recoverable? And I believe there should be a way to exit the loop, for 
>>>> example a maximum times to  retry.
>>>>
>>>> It seems that the agent mode is not mature enough to work in production.
>>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/5661de1d-21c5-486c-9177-fa346ebdc922n%40googlegroups.com.

[prometheus-users] Re: All Samples Lost when prometheus server return 500 to prometheus agent

Reply via email to