Thanks a lot, Brian. I think I should create an issue. On Monday, May 20, 2024 at 1:37:51 PM UTC+8 Brian Candler wrote:
> > server returned HTTP status 500 Internal Server Error: too old sample > > This is not the server failing to process the data; it's the client > supplying invalid data. You found that this has been fixed to a 400. > > > server returned HTTP status 500 Internal Server Error: label name > \"prometheus\" is not unique: invalid sample > > I can't speak for the authors, but it looks to me like that should be a > 400 as well. > > On Monday 20 May 2024 at 04:52:03 UTC+1 koly li wrote: > >> Sorry for my poor description. Here is the story: >> >> 1) At first, we are using prometheus v2.47 >> >> Then we found all metrics are missing, we check the prometheus log and >> prometheus agent log: >> >> prometheus log(lots of lines): >> ts=2024-04-19T20:33:26.485Z caller=write_handler.go:76 level=error >> component=web msg="Error appending remote write" err="too old sample" >> ts=2024-04-19T20:33:26.539Z caller=write_handler.go:76 level=error >> component=web msg="Error appending remote write" err="too old sample" >> ts=2024-04-19T20:33:26.626Z caller=write_handler.go:76 level=error >> component=web msg="Error appending remote write" err="too old sample" >> ts=2024-04-19T20:33:26.775Z caller=write_handler.go:76 level=error >> component=web msg="Error appending remote write" err="too old sample" >> ts=2024-04-19T20:33:27.042Z caller=write_handler.go:76 level=error >> component=web msg="Error appending remote write" err="too old sample" >> ts=2024-04-19T20:33:27.552Z caller=write_handler.go:76 level=error >> component=web msg="Error appending remote write" err="too old sample" >> .... >> ts=2024-04-22T03:00:03.327Z caller=write_handler.go:76 level=error >> component=web msg="Error appending remote write" err="too old sample" >> ts=2024-04-22T03:00:08.394Z caller=write_handler.go:76 level=error >> component=web msg="Error appending remote write" err="too old sample" >> >> prometheus agent logs: >> ts=2024-04-19T20:33:26.517Z caller=dedupe.go:112 component=remote >> level=warn remote_name=prometheus-k8s-0 url= >> https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to >> send batch, retrying" err="server returned HTTP status 500 Internal Server >> Error: too old sample" >> ts=2024-04-19T20:34:29.714Z caller=dedupe.go:112 component=remote >> level=warn remote_name=prometheus-k8s-0 url= >> https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to >> send batch, retrying" err="server returned HTTP status 500 Internal Server >> Error: too old sample" >> ts=2024-04-19T20:35:30.113Z caller=dedupe.go:112 component=remote >> level=warn remote_name=prometheus-k8s-0 url= >> https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to >> send batch, retrying" err="server returned HTTP status 500 Internal Server >> Error: too old sample" >> ts=2024-04-19T20:36:30.478Z caller=dedupe.go:112 component=remote >> level=warn remote_name=prometheus-k8s-0 url= >> https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to >> send batch, retrying" err="server returned HTTP status 500 Internal Server >> Error: too old sample" >> .... >> ts=2024-04-22T02:56:57.281Z caller=dedupe.go:112 component=remote >> level=warn remote_name=prometheus-k8s-0 url= >> https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to >> send batch, retrying" err="server returned HTTP status 500 Internal Server >> Error: too old sample" >> ts=2024-04-22T02:57:57.624Z caller=dedupe.go:112 component=remote >> level=warn remote_name=prometheus-k8s-0 url= >> https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to >> send batch, retrying" err="server returned HTTP status 500 Internal Server >> Error: too old sample" >> ts=2024-04-22T02:58:57.943Z caller=dedupe.go:112 component=remote >> level=warn remote_name=prometheus-k8s-0 url= >> https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to >> send batch, retrying" err="server returned HTTP status 500 Internal Server >> Error: too old sample" >> ts=2024-04-22T02:59:58.267Z caller=dedupe.go:112 component=remote >> level=warn remote_name=prometheus-k8s-0 url= >> https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to >> send batch, retrying" err="server returned HTTP status 500 Internal Server >> Error: too old sample" >> ts=2024-04-22T03:00:58.733Z caller=dedupe.go:112 component=remote >> level=warn remote_name=prometheus-k8s-0 url= >> https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to >> send batch, retrying" err="server returned HTTP status 500 Internal Server >> Error: too old sample" >> >> Then we check the codes: >> >> https://github.com/prometheus/prometheus/blob/release-2.47/storage/remote/write_handler.go#L77 >> >> The "too old sample" is considered an 500. And the agent keeps retrying >> (exit only when the error is not Recoverable, and 500 is considered >> Recoverable): >> >> https://github.com/prometheus/prometheus/blob/release-2.51/storage/remote/queue_manager.go#L1670 >> >> > You may have come across a bug where a *particular* piece of data being >> sent by the agent was causing a *particular* version of prometheus to fail >> with a 5xx internal error every time. The logs should make it clear if this >> was happening. >> We guess there is one or more samples with too old timestamps which cause >> the problem. One or more samples with "too old timestamp" cause the >> prometheus agent to retry forever (agent receives 500) , which prevents new >> samples to be sent. >> >> Because there is no logging about the incorrect samples, we cannot figure >> out what these samples are. It logs samples only for several errors: >> https://github.com/prometheus/prometheus/blob/release-2.47/storage/remote/write_handler.go#L132 >> >> > The fundamental issue here is, why should restarting the *agent* cause >> the prometheus *server* to stop returning 500 errors? >> We restart the agent 1~2 days after the problem occurs. The new data does >> not contain too old samples. That's why 500 errors disappear. >> >> 2) then we upgrade to v2.51 >> The new version returns 400 for "too old samples": >> >> https://github.com/prometheus/prometheus/blob/release-2.51/storage/remote/write_handler.go#L72 >> >> However, we encountered another 500: >> prometheus agent log: >> ts=2024-05-11T08:42:01.235Z caller=dedupe.go:112 component=remote >> level=warn remote_name=prometheus-k8s-0 url= >> https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to >> send batch, retrying" err="server returned HTTP status 500 Internal Server >> Error: label name \"prometheus\" is not unique: invalid sample" >> ts=2024-05-11T08:42:02.749Z caller=dedupe.go:112 component=remote >> level=warn remote_name=prometheus-k8s-0 url= >> https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to >> send batch, retrying" err="server returned HTTP status 500 Internal Server >> Error: label name \"service\" is not unique: invalid sample" >> ts=2024-05-11T08:42:02.798Z caller=dedupe.go:112 component=remote >> level=warn remote_name=prometheus-k8s-0 url= >> https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to >> send batch, retrying" err="server returned HTTP status 500 Internal Server >> Error: label name \"resourceType\" is not unique: invalid sample" >> ts=2024-05-11T08:42:02.851Z caller=dedupe.go:112 component=remote >> level=warn remote_name=prometheus-k8s-0 url= >> https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to >> send batch, retrying" err="server returned HTTP status 500 Internal Server >> Error: label name \"namespace\" is not unique: invalid sample" >> >> we modify the code to log samples, then we get the prometheus log: >> ts=2024-05-11T08:42:26.603Z caller=write_handler.go:134 level=error >> component=web msg="unknown error from remote write" err="label name >> \"resourceId\" is not unique: invalid sample" >> series="{__name__=\"ovs_vswitchd_interface_resets_total\", >> clusterName=\"clustertest150\", clusterRegion=\"region0\", >> clusterZone=\"zone1\", container=\"kube-rbac-proxy\", >> endpoint=\"ovs-metrics\", hostname=\"20230428-wangbo-dev16\", >> if_name=\"veth99fa6555\", instance=\"10.253.58.238:9983\", >> job=\"net-monitor-vnet-ovs\", namespace=\"net-monitor\", >> pod=\"net-monitor-vnet-ovs-66bdz\", prometheus=\"monitoring/agent-0\", >> prometheus_replica=\"prometheus-agent-0-0\", >> resourceId=\"port-naqoi5tmkg5lrt0ubw\", resourceId=\"blb-74se39mqa9k3\", >> resourceType=\"Port\", resourceType=\"BLB\", rs_ip=\"10.0.0.3\", >> service=\"net-monitor-vnet-ovs\", service=\"net-monitor-vnet-ovs\", >> subnet_Id=\"snet-ztojflwrnd08xf5idw\", vip=\"11.4.2.64\", >> vpc_Id=\"vpc-6ss1uz29ctpfv0eqbj\", vpcid=\"11.4.2.64\"}" >> timestamp=1715349156000 >> ts=2024-05-11T08:42:26.603Z caller=write_handler.go:76 level=error >> component=web msg="Error appending remote write" err="label name >> \"resourceId\" is not unique: invalid sample" >> ts=2024-05-11T08:42:26.967Z caller=write_handler.go:134 level=error >> component=web msg="unknown error from remote write" err="label name >> \"service\" is not unique: invalid sample" >> series="{__name__=\"rest_client_request_size_bytes_bucket\", >> clusterName=\"clustertest150\", clusterRegion=\"region0\", >> clusterZone=\"zone1\", container=\"kube-scheduler\", endpoint=\"https\", >> host=\"127.0.0.1:6443\", instance=\"10.253.58.236:10259\", >> job=\"scheduler\", le=\"262144\", namespace=\"kube-scheduler\", >> pod=\"kube-scheduler-20230428-wangbo-dev14\", >> prometheus=\"monitoring/agent-0\", >> prometheus_replica=\"prometheus-agent-0-0\", resourceType=\"NETWORK-HOST\", >> service=\"scheduler\", service=\"net-monitor-vnet-ovs\", verb=\"POST\"}" >> timestamp=1715349164522 >> ts=2024-05-11T08:42:26.967Z caller=write_handler.go:76 level=error >> component=web msg="Error appending remote write" err="label name >> \"service\" is not unique: invalid sample" >> ts=2024-05-11T08:42:27.091Z caller=write_handler.go:134 level=error >> component=web msg="unknown error from remote write" err="label name >> \"prometheus_replica\" is not unique: invalid sample" >> series="{__name__=\"workqueue_work_duration_seconds_sum\", >> clusterName=\"clustertest150\", clusterRegion=\"region0\", >> clusterZone=\"zone1\", endpoint=\"https\", instance=\"21.100.10.52:8443\", >> job=\"metrics\", name=\"ResourceSyncController\", >> namespace=\"service-ca-operator\", >> pod=\"service-ca-operator-645cfdbfb6-rjr4z\", >> prometheus=\"monitoring/agent-0\", >> prometheus_replica=\"prometheus-agent-0-0\", >> prometheus_replica=\"prometheus-agent-0-0\", service=\"metrics\"}" >> timestamp=1715349271085 >> ts=2024-05-11T08:42:27.091Z caller=write_handler.go:76 level=error >> component=web msg="Error appending remote write" err="label name >> \"prometheus_replica\" is not unique: invalid sample" >> >> Currently we dont' know why there are duplicated labels. But when the >> server encounters duplicated labels, it returns 500. Then the agent keeps >> retrying, which means new samples cannot be handled. >> >> we set external_labels in prometheus-agent configs: >> global: >> evaluation_interval: 30s >> scrape_interval: 5m >> scrape_timeout: 1m >> external_labels: >> clusterName: clustertest150 >> clusterRegion: region0 >> clusterZone: zone1 >> prometheus: ccos-monitoring/agent-0 >> prometheus_replica: prometheus-agent-0-0 >> keep_dropped_targets: 1 >> >> and the remote write config: >> remote_write: >> - url: https://prometheus-k8s-0.monitoring:9091/api/v1/write >> remote_timeout: 30s >> name: prometheus-k8s-0 >> write_relabel_configs: >> - target_label: __tmp_cluster_id__ >> replacement: 713c30cb-81c3-411d-b4dc-0c775a0f9564 >> action: replace >> - regex: __tmp_cluster_id__ >> action: labeldrop >> bearer_token: XDFSDF... >> tls_config: >> insecure_skip_verify: true >> queue_config: >> capacity: 10000 >> min_shards: 1 >> max_shards: 500 >> max_samples_per_send: 2000 >> batch_send_deadline: 10s >> min_backoff: 30ms >> max_backoff: 5s >> sample_age_limit: 5m >> >> > You are saying that you would prefer the agent to throw away data, >> rather than hold onto the data and try again later when it may succeed. In >> this situation, retrying is normally the correct thing to do. >> Yes, retry is the normal solution. But there should be maximum number of >> retries. We notice that prometheus agent sets the retry nubmers to the >> request header, but it seems the request header is not used by the server. >> >> prometheus-agent sets the retry numbers to request header: >> >> https://github.com/prometheus/prometheus/blob/release-2.51/storage/remote/client.go#L214 >> >> Besides, if some samples is incorrect and others are correct in the same >> request, why don't prometheus server save the correct part and drop the >> wrong part? It is more complicated as retry should be considered, but is it >> possible to save partial data and return 206 when the maximum number of >> retries is reached? >> >> And should prometheus server log samples for all kinds of error? >> >> https://github.com/prometheus/prometheus/blob/release-2.51/storage/remote/write_handler.go#L133 >> >> On Friday, May 17, 2024 at 8:15:04 PM UTC+8 Brian Candler wrote: >> >>> It's difficult to make sense of what you're saying. Without seeing logs >>> from both the agent and the server while this problem was occurring (e.g. >>> `journalctl -eu prometheus`), it's hard to know what was really happening. >>> Also you need to say what exact versions of prometheus and the agent were >>> running. >>> >>> The fundamental issue here is, why should restarting the *agent* cause >>> the prometheus *server* to stop returning 500 errors? >>> >>> >>> > So my question is why 5xx from the promtheus server is considered >>> Recoverable? >>> >>> It is by definition of the HTTP protocol: >>> https://datatracker.ietf.org/doc/html/rfc2616#section-10.5 >>> >>> Actually it depends on exactly which 5xx error code you're talking >>> about, but common 500 and 503 errors are generally transient, meaning there >>> was a problem at the server and the request may succeed if tried again >>> later. If the prometheus server wanted to tell the client that the request >>> was invalid and could never possibly succeed, then it would return a 4xx >>> error. >>> >>> > And I believe there should be a way to exit the loop, for example a >>> maximum times to retry. >>> >>> You are saying that you would prefer the agent to throw away data, >>> rather than hold onto the data and try again later when it may succeed. In >>> this situation, retrying is normally the correct thing to do. >>> >>> You may have come across a bug where a *particular* piece of data being >>> sent by the agent was causing a *particular* version of prometheus to fail >>> with a 5xx internal error every time. The logs should make it clear if this >>> was happening. >>> >>> On Friday 17 May 2024 at 10:02:49 UTC+1 koly li wrote: >>> >>>> Hello all, >>>> >>>> Recently we found that our samples are all lost. After some >>>> investigation, we found: >>>> 1, we are using prometheus agent to send all data to prometheus server >>>> by remote write >>>> 2, the agent sample sending code is in storage\remote\queue_manager.go, >>>> the function is sendWriteRequestWithBackoff() >>>> 3, inside the function, if attempt(the function where request is made >>>> to prometheus server) function returns an Recoverable Error, then it will >>>> retry sending the request >>>> 4, when a Recoverable error is returned? one scenario is the prometheus >>>> server returned 5xx error >>>> 5, I think not every 5xx error is recoverable, and there is no other >>>> way to exit the for loop in sendWriteRequestWithBackoff(). The agent keeps >>>> retrying but every time it receives an 5xx from the server. so we lost all >>>> samples for hours until we restart the agent >>>> >>>> So my question is why 5xx from the promtheus server is considered >>>> Recoverable? And I believe there should be a way to exit the loop, for >>>> example a maximum times to retry. >>>> >>>> It seems that the agent mode is not mature enough to work in production. >>>> >>> -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/5661de1d-21c5-486c-9177-fa346ebdc922n%40googlegroups.com.

