[prometheus-users] Re: All Samples Lost when prometheus server return 500 to prometheus agent

koly li Sun, 19 May 2024 20:52:12 -0700

Sorry for my poor description. Here is the story:

1) At first, we are using prometheus v2.47

Then we found all metrics are missing, we check the prometheus log and 
prometheus agent log:

prometheus log(lots of lines):
ts=2024-04-19T20:33:26.485Z caller=write_handler.go:76 level=error 
component=web msg="Error appending remote write" err="too old sample"
ts=2024-04-19T20:33:26.539Z caller=write_handler.go:76 level=error 
component=web msg="Error appending remote write" err="too old sample"
ts=2024-04-19T20:33:26.626Z caller=write_handler.go:76 level=error 
component=web msg="Error appending remote write" err="too old sample"
ts=2024-04-19T20:33:26.775Z caller=write_handler.go:76 level=error 
component=web msg="Error appending remote write" err="too old sample"
ts=2024-04-19T20:33:27.042Z caller=write_handler.go:76 level=error 
component=web msg="Error appending remote write" err="too old sample"
ts=2024-04-19T20:33:27.552Z caller=write_handler.go:76 level=error 
component=web msg="Error appending remote write" err="too old sample"
....
ts=2024-04-22T03:00:03.327Z caller=write_handler.go:76 level=error 
component=web msg="Error appending remote write" err="too old sample"
ts=2024-04-22T03:00:08.394Z caller=write_handler.go:76 level=error 
component=web msg="Error appending remote write" err="too old sample"

prometheus agent logs:
ts=2024-04-19T20:33:26.517Z caller=dedupe.go:112 component=remote 
level=warn remote_name=prometheus-k8s-0 
url=https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to 
send batch, retrying" err="server returned HTTP status 500 Internal Server 
Error: too old sample"
ts=2024-04-19T20:34:29.714Z caller=dedupe.go:112 component=remote 
level=warn remote_name=prometheus-k8s-0 
url=https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to 
send batch, retrying" err="server returned HTTP status 500 Internal Server 
Error: too old sample"
ts=2024-04-19T20:35:30.113Z caller=dedupe.go:112 component=remote 
level=warn remote_name=prometheus-k8s-0 
url=https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to 
send batch, retrying" err="server returned HTTP status 500 Internal Server 
Error: too old sample"
ts=2024-04-19T20:36:30.478Z caller=dedupe.go:112 component=remote 
level=warn remote_name=prometheus-k8s-0 
url=https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to 
send batch, retrying" err="server returned HTTP status 500 Internal Server 
Error: too old sample"
....
ts=2024-04-22T02:56:57.281Z caller=dedupe.go:112 component=remote 
level=warn remote_name=prometheus-k8s-0 
url=https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to 
send batch, retrying" err="server returned HTTP status 500 Internal Server 
Error: too old sample"
ts=2024-04-22T02:57:57.624Z caller=dedupe.go:112 component=remote 
level=warn remote_name=prometheus-k8s-0 
url=https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to 
send batch, retrying" err="server returned HTTP status 500 Internal Server 
Error: too old sample"
ts=2024-04-22T02:58:57.943Z caller=dedupe.go:112 component=remote 
level=warn remote_name=prometheus-k8s-0 
url=https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to 
send batch, retrying" err="server returned HTTP status 500 Internal Server 
Error: too old sample"
ts=2024-04-22T02:59:58.267Z caller=dedupe.go:112 component=remote 
level=warn remote_name=prometheus-k8s-0 
url=https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to 
send batch, retrying" err="server returned HTTP status 500 Internal Server 
Error: too old sample"
ts=2024-04-22T03:00:58.733Z caller=dedupe.go:112 component=remote 
level=warn remote_name=prometheus-k8s-0 
url=https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to 
send batch, retrying" err="server returned HTTP status 500 Internal Server 
Error: too old sample"

Then we check the codes:
https://github.com/prometheus/prometheus/blob/release-2.47/storage/remote/write_handler.go#L77

The "too old sample" is considered an 500. And the agent keeps retrying 
(exit only when the error is not Recoverable, and 500 is considered 
Recoverable):
https://github.com/prometheus/prometheus/blob/release-2.51/storage/remote/queue_manager.go#L1670

> You may have come across a bug where a *particular* piece of data being 
sent by the agent was causing a *particular* version of prometheus to fail 
with a 5xx internal error every time. The logs should make it clear if this 
was happening.
We guess there is one or more samples with too old timestamps which cause 
the problem. One or more samples with "too old timestamp" cause the 
prometheus agent to retry forever (agent receives 500) , which prevents new 
samples to be sent. 

Because there is no logging about the incorrect samples, we cannot figure 
out what these samples are. It logs samples only for several 
errors: 
https://github.com/prometheus/prometheus/blob/release-2.47/storage/remote/write_handler.go#L132

>  The fundamental issue here is, why should restarting the *agent* cause 
the prometheus *server* to stop returning 500 errors?
We restart the agent 1~2 days after the problem occurs. The new data does 
not contain too old samples. That's why 500 errors disappear.

2) then we upgrade to v2.51
The new version returns 400 for "too old samples":
https://github.com/prometheus/prometheus/blob/release-2.51/storage/remote/write_handler.go#L72

However, we encountered another 500:
prometheus agent log:
ts=2024-05-11T08:42:01.235Z caller=dedupe.go:112 component=remote 
level=warn remote_name=prometheus-k8s-0 
url=https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to 
send batch, retrying" err="server returned HTTP status 500 Internal Server 
Error: label name \"prometheus\" is not unique: invalid sample"
ts=2024-05-11T08:42:02.749Z caller=dedupe.go:112 component=remote 
level=warn remote_name=prometheus-k8s-0 
url=https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to 
send batch, retrying" err="server returned HTTP status 500 Internal Server 
Error: label name \"service\" is not unique: invalid sample"
ts=2024-05-11T08:42:02.798Z caller=dedupe.go:112 component=remote 
level=warn remote_name=prometheus-k8s-0 
url=https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to 
send batch, retrying" err="server returned HTTP status 500 Internal Server 
Error: label name \"resourceType\" is not unique: invalid sample"
ts=2024-05-11T08:42:02.851Z caller=dedupe.go:112 component=remote 
level=warn remote_name=prometheus-k8s-0 
url=https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to 
send batch, retrying" err="server returned HTTP status 500 Internal Server 
Error: label name \"namespace\" is not unique: invalid sample"

we modify the code to log samples, then we get the prometheus log:
ts=2024-05-11T08:42:26.603Z caller=write_handler.go:134 level=error 
component=web msg="unknown error from remote write" err="label name 
\"resourceId\" is not unique: invalid sample" 
series="{__name__=\"ovs_vswitchd_interface_resets_total\", 
clusterName=\"clustertest150\", clusterRegion=\"region0\", 
clusterZone=\"zone1\", container=\"kube-rbac-proxy\", 
endpoint=\"ovs-metrics\", hostname=\"20230428-wangbo-dev16\", 
if_name=\"veth99fa6555\", instance=\"10.253.58.238:9983\", 
job=\"net-monitor-vnet-ovs\", namespace=\"net-monitor\", 
pod=\"net-monitor-vnet-ovs-66bdz\", prometheus=\"monitoring/agent-0\", 
prometheus_replica=\"prometheus-agent-0-0\", 
resourceId=\"port-naqoi5tmkg5lrt0ubw\", resourceId=\"blb-74se39mqa9k3\", 
resourceType=\"Port\", resourceType=\"BLB\", rs_ip=\"10.0.0.3\", 
service=\"net-monitor-vnet-ovs\", service=\"net-monitor-vnet-ovs\", 
subnet_Id=\"snet-ztojflwrnd08xf5idw\", vip=\"11.4.2.64\", 
vpc_Id=\"vpc-6ss1uz29ctpfv0eqbj\", vpcid=\"11.4.2.64\"}" 
timestamp=1715349156000
ts=2024-05-11T08:42:26.603Z caller=write_handler.go:76 level=error 
component=web msg="Error appending remote write" err="label name 
\"resourceId\" is not unique: invalid sample"
ts=2024-05-11T08:42:26.967Z caller=write_handler.go:134 level=error 
component=web msg="unknown error from remote write" err="label name 
\"service\" is not unique: invalid sample" 
series="{__name__=\"rest_client_request_size_bytes_bucket\", 
clusterName=\"clustertest150\", clusterRegion=\"region0\", 
clusterZone=\"zone1\", container=\"kube-scheduler\", endpoint=\"https\", 
host=\"127.0.0.1:6443\", instance=\"10.253.58.236:10259\", 
job=\"scheduler\", le=\"262144\", namespace=\"kube-scheduler\", 
pod=\"kube-scheduler-20230428-wangbo-dev14\", 
prometheus=\"monitoring/agent-0\", 
prometheus_replica=\"prometheus-agent-0-0\", resourceType=\"NETWORK-HOST\", 
service=\"scheduler\", service=\"net-monitor-vnet-ovs\", verb=\"POST\"}" 
timestamp=1715349164522
ts=2024-05-11T08:42:26.967Z caller=write_handler.go:76 level=error 
component=web msg="Error appending remote write" err="label name 
\"service\" is not unique: invalid sample"
ts=2024-05-11T08:42:27.091Z caller=write_handler.go:134 level=error 
component=web msg="unknown error from remote write" err="label name 
\"prometheus_replica\" is not unique: invalid sample" 
series="{__name__=\"workqueue_work_duration_seconds_sum\", 
clusterName=\"clustertest150\", clusterRegion=\"region0\", 
clusterZone=\"zone1\", endpoint=\"https\", instance=\"21.100.10.52:8443\", 
job=\"metrics\", name=\"ResourceSyncController\", 
namespace=\"service-ca-operator\", 
pod=\"service-ca-operator-645cfdbfb6-rjr4z\", 
prometheus=\"monitoring/agent-0\", 
prometheus_replica=\"prometheus-agent-0-0\", 
prometheus_replica=\"prometheus-agent-0-0\", service=\"metrics\"}" 
timestamp=1715349271085
ts=2024-05-11T08:42:27.091Z caller=write_handler.go:76 level=error 
component=web msg="Error appending remote write" err="label name 
\"prometheus_replica\" is not unique: invalid sample"

Currently we dont' know why there are duplicated labels. But when the 
server encounters duplicated labels, it returns 500. Then the agent keeps 
retrying, which means new samples cannot be handled.

we set external_labels in prometheus-agent configs:
global:
  evaluation_interval: 30s
  scrape_interval: 5m
  scrape_timeout: 1m
  external_labels:
    clusterName: clustertest150
    clusterRegion: region0
    clusterZone: zone1
    prometheus: ccos-monitoring/agent-0
    prometheus_replica: prometheus-agent-0-0
  keep_dropped_targets: 1

and the remote write config:
remote_write:
- url: https://prometheus-k8s-0.monitoring:9091/api/v1/write
  remote_timeout: 30s
  name: prometheus-k8s-0
  write_relabel_configs:
  - target_label: __tmp_cluster_id__
    replacement: 713c30cb-81c3-411d-b4dc-0c775a0f9564
    action: replace
  - regex: __tmp_cluster_id__
    action: labeldrop
  bearer_token: XDFSDF...
  tls_config:
    insecure_skip_verify: true
  queue_config:
    capacity: 10000
    min_shards: 1
    max_shards: 500
    max_samples_per_send: 2000
    batch_send_deadline: 10s
    min_backoff: 30ms
    max_backoff: 5s
    sample_age_limit: 5m

> You are saying that you would prefer the agent to throw away data, rather 
than hold onto the data and try again later when it may succeed. In this 
situation, retrying is normally the correct thing to do.
Yes, retry is the normal solution. But there should be maximum number of 
retries. We notice that prometheus agent sets the retry nubmers to the 
request header, but it seems the request header is not used by the server.

prometheus-agent sets the retry numbers to request header:
https://github.com/prometheus/prometheus/blob/release-2.51/storage/remote/client.go#L214

Besides, if some samples is incorrect and others are correct in the same 
request, why don't prometheus server save the correct part and drop the 
wrong part? It is more complicated as retry should be considered, but is it 
possible to save partial data and return 206 when the maximum number of 
retries is reached? 

And should prometheus server log samples for all kinds of error?
https://github.com/prometheus/prometheus/blob/release-2.51/storage/remote/write_handler.go#L133

On Friday, May 17, 2024 at 8:15:04 PM UTC+8 Brian Candler wrote:

> It's difficult to make sense of what you're saying. Without seeing logs 
> from both the agent and the server while this problem was occurring (e.g. 
> `journalctl -eu prometheus`), it's hard to know what was really happening. 
> Also you need to say what exact versions of prometheus and the agent were 
> running.
>
> The fundamental issue here is, why should restarting the *agent* cause the 
> prometheus *server* to stop returning 500 errors?
>
>
> > So my question is why 5xx from the promtheus server is considered 
> Recoverable?
>
> It is by definition of the HTTP protocol: 
> https://datatracker.ietf.org/doc/html/rfc2616#section-10.5
>
> Actually it depends on exactly which 5xx error code you're talking about, 
> but common 500 and 503 errors are generally transient, meaning there was a 
> problem at the server and the request may succeed if tried again later.  If 
> the prometheus server wanted to tell the client that the request was 
> invalid and could never possibly succeed, then it would return a 4xx error.
>
> > And I believe there should be a way to exit the loop, for example a 
> maximum times to  retry.
>
> You are saying that you would prefer the agent to throw away data, rather 
> than hold onto the data and try again later when it may succeed. In this 
> situation, retrying is normally the correct thing to do.
>
> You may have come across a bug where a *particular* piece of data being 
> sent by the agent was causing a *particular* version of prometheus to fail 
> with a 5xx internal error every time. The logs should make it clear if this 
> was happening.
>
> On Friday 17 May 2024 at 10:02:49 UTC+1 koly li wrote:
>
>> Hello all,
>>
>> Recently we found that our samples are all lost. After some 
>> investigation, we found:
>> 1, we are using prometheus agent to send all data to prometheus server by 
>> remote write
>> 2, the agent sample sending code is in storage\remote\queue_manager.go, 
>> the function is sendWriteRequestWithBackoff()
>> 3, inside the function, if attempt(the function where request is made to 
>> prometheus server) function returns an Recoverable Error, then it will 
>> retry sending the request
>> 4, when a Recoverable error is returned? one scenario is the prometheus 
>> server returned 5xx error
>> 5, I think not every 5xx error is recoverable, and there is no other way 
>> to exit the for loop in sendWriteRequestWithBackoff(). The agent keeps 
>> retrying but every time it receives an 5xx from the server. so we lost all 
>> samples for hours until we restart the agent
>>
>> So my question is why 5xx from the promtheus server is considered 
>> Recoverable? And I believe there should be a way to exit the loop, for 
>> example a maximum times to  retry.
>>
>> It seems that the agent mode is not mature enough to work in production.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/d0556d40-84a8-4b0a-acc0-40c9b58c279cn%40googlegroups.com.

[prometheus-users] Re: All Samples Lost when prometheus server return 500 to prometheus agent

Reply via email to