> server returned HTTP status 500 Internal Server Error: too old sample This is not the server failing to process the data; it's the client supplying invalid data. You found that this has been fixed to a 400.
> server returned HTTP status 500 Internal Server Error: label name \"prometheus\" is not unique: invalid sample I can't speak for the authors, but it looks to me like that should be a 400 as well. On Monday 20 May 2024 at 04:52:03 UTC+1 koly li wrote: > Sorry for my poor description. Here is the story: > > 1) At first, we are using prometheus v2.47 > > Then we found all metrics are missing, we check the prometheus log and > prometheus agent log: > > prometheus log(lots of lines): > ts=2024-04-19T20:33:26.485Z caller=write_handler.go:76 level=error > component=web msg="Error appending remote write" err="too old sample" > ts=2024-04-19T20:33:26.539Z caller=write_handler.go:76 level=error > component=web msg="Error appending remote write" err="too old sample" > ts=2024-04-19T20:33:26.626Z caller=write_handler.go:76 level=error > component=web msg="Error appending remote write" err="too old sample" > ts=2024-04-19T20:33:26.775Z caller=write_handler.go:76 level=error > component=web msg="Error appending remote write" err="too old sample" > ts=2024-04-19T20:33:27.042Z caller=write_handler.go:76 level=error > component=web msg="Error appending remote write" err="too old sample" > ts=2024-04-19T20:33:27.552Z caller=write_handler.go:76 level=error > component=web msg="Error appending remote write" err="too old sample" > .... > ts=2024-04-22T03:00:03.327Z caller=write_handler.go:76 level=error > component=web msg="Error appending remote write" err="too old sample" > ts=2024-04-22T03:00:08.394Z caller=write_handler.go:76 level=error > component=web msg="Error appending remote write" err="too old sample" > > prometheus agent logs: > ts=2024-04-19T20:33:26.517Z caller=dedupe.go:112 component=remote > level=warn remote_name=prometheus-k8s-0 url= > https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to send > batch, retrying" err="server returned HTTP status 500 Internal Server > Error: too old sample" > ts=2024-04-19T20:34:29.714Z caller=dedupe.go:112 component=remote > level=warn remote_name=prometheus-k8s-0 url= > https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to send > batch, retrying" err="server returned HTTP status 500 Internal Server > Error: too old sample" > ts=2024-04-19T20:35:30.113Z caller=dedupe.go:112 component=remote > level=warn remote_name=prometheus-k8s-0 url= > https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to send > batch, retrying" err="server returned HTTP status 500 Internal Server > Error: too old sample" > ts=2024-04-19T20:36:30.478Z caller=dedupe.go:112 component=remote > level=warn remote_name=prometheus-k8s-0 url= > https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to send > batch, retrying" err="server returned HTTP status 500 Internal Server > Error: too old sample" > .... > ts=2024-04-22T02:56:57.281Z caller=dedupe.go:112 component=remote > level=warn remote_name=prometheus-k8s-0 url= > https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to send > batch, retrying" err="server returned HTTP status 500 Internal Server > Error: too old sample" > ts=2024-04-22T02:57:57.624Z caller=dedupe.go:112 component=remote > level=warn remote_name=prometheus-k8s-0 url= > https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to send > batch, retrying" err="server returned HTTP status 500 Internal Server > Error: too old sample" > ts=2024-04-22T02:58:57.943Z caller=dedupe.go:112 component=remote > level=warn remote_name=prometheus-k8s-0 url= > https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to send > batch, retrying" err="server returned HTTP status 500 Internal Server > Error: too old sample" > ts=2024-04-22T02:59:58.267Z caller=dedupe.go:112 component=remote > level=warn remote_name=prometheus-k8s-0 url= > https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to send > batch, retrying" err="server returned HTTP status 500 Internal Server > Error: too old sample" > ts=2024-04-22T03:00:58.733Z caller=dedupe.go:112 component=remote > level=warn remote_name=prometheus-k8s-0 url= > https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to send > batch, retrying" err="server returned HTTP status 500 Internal Server > Error: too old sample" > > Then we check the codes: > > https://github.com/prometheus/prometheus/blob/release-2.47/storage/remote/write_handler.go#L77 > > The "too old sample" is considered an 500. And the agent keeps retrying > (exit only when the error is not Recoverable, and 500 is considered > Recoverable): > > https://github.com/prometheus/prometheus/blob/release-2.51/storage/remote/queue_manager.go#L1670 > > > You may have come across a bug where a *particular* piece of data being > sent by the agent was causing a *particular* version of prometheus to fail > with a 5xx internal error every time. The logs should make it clear if this > was happening. > We guess there is one or more samples with too old timestamps which cause > the problem. One or more samples with "too old timestamp" cause the > prometheus agent to retry forever (agent receives 500) , which prevents new > samples to be sent. > > Because there is no logging about the incorrect samples, we cannot figure > out what these samples are. It logs samples only for several errors: > https://github.com/prometheus/prometheus/blob/release-2.47/storage/remote/write_handler.go#L132 > > > The fundamental issue here is, why should restarting the *agent* cause > the prometheus *server* to stop returning 500 errors? > We restart the agent 1~2 days after the problem occurs. The new data does > not contain too old samples. That's why 500 errors disappear. > > 2) then we upgrade to v2.51 > The new version returns 400 for "too old samples": > > https://github.com/prometheus/prometheus/blob/release-2.51/storage/remote/write_handler.go#L72 > > However, we encountered another 500: > prometheus agent log: > ts=2024-05-11T08:42:01.235Z caller=dedupe.go:112 component=remote > level=warn remote_name=prometheus-k8s-0 url= > https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to send > batch, retrying" err="server returned HTTP status 500 Internal Server > Error: label name \"prometheus\" is not unique: invalid sample" > ts=2024-05-11T08:42:02.749Z caller=dedupe.go:112 component=remote > level=warn remote_name=prometheus-k8s-0 url= > https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to send > batch, retrying" err="server returned HTTP status 500 Internal Server > Error: label name \"service\" is not unique: invalid sample" > ts=2024-05-11T08:42:02.798Z caller=dedupe.go:112 component=remote > level=warn remote_name=prometheus-k8s-0 url= > https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to send > batch, retrying" err="server returned HTTP status 500 Internal Server > Error: label name \"resourceType\" is not unique: invalid sample" > ts=2024-05-11T08:42:02.851Z caller=dedupe.go:112 component=remote > level=warn remote_name=prometheus-k8s-0 url= > https://prometheus-k8s-0.monitoring:9091/api/v1/write msg="Failed to send > batch, retrying" err="server returned HTTP status 500 Internal Server > Error: label name \"namespace\" is not unique: invalid sample" > > we modify the code to log samples, then we get the prometheus log: > ts=2024-05-11T08:42:26.603Z caller=write_handler.go:134 level=error > component=web msg="unknown error from remote write" err="label name > \"resourceId\" is not unique: invalid sample" > series="{__name__=\"ovs_vswitchd_interface_resets_total\", > clusterName=\"clustertest150\", clusterRegion=\"region0\", > clusterZone=\"zone1\", container=\"kube-rbac-proxy\", > endpoint=\"ovs-metrics\", hostname=\"20230428-wangbo-dev16\", > if_name=\"veth99fa6555\", instance=\"10.253.58.238:9983\", > job=\"net-monitor-vnet-ovs\", namespace=\"net-monitor\", > pod=\"net-monitor-vnet-ovs-66bdz\", prometheus=\"monitoring/agent-0\", > prometheus_replica=\"prometheus-agent-0-0\", > resourceId=\"port-naqoi5tmkg5lrt0ubw\", resourceId=\"blb-74se39mqa9k3\", > resourceType=\"Port\", resourceType=\"BLB\", rs_ip=\"10.0.0.3\", > service=\"net-monitor-vnet-ovs\", service=\"net-monitor-vnet-ovs\", > subnet_Id=\"snet-ztojflwrnd08xf5idw\", vip=\"11.4.2.64\", > vpc_Id=\"vpc-6ss1uz29ctpfv0eqbj\", vpcid=\"11.4.2.64\"}" > timestamp=1715349156000 > ts=2024-05-11T08:42:26.603Z caller=write_handler.go:76 level=error > component=web msg="Error appending remote write" err="label name > \"resourceId\" is not unique: invalid sample" > ts=2024-05-11T08:42:26.967Z caller=write_handler.go:134 level=error > component=web msg="unknown error from remote write" err="label name > \"service\" is not unique: invalid sample" > series="{__name__=\"rest_client_request_size_bytes_bucket\", > clusterName=\"clustertest150\", clusterRegion=\"region0\", > clusterZone=\"zone1\", container=\"kube-scheduler\", endpoint=\"https\", > host=\"127.0.0.1:6443\", instance=\"10.253.58.236:10259\", > job=\"scheduler\", le=\"262144\", namespace=\"kube-scheduler\", > pod=\"kube-scheduler-20230428-wangbo-dev14\", > prometheus=\"monitoring/agent-0\", > prometheus_replica=\"prometheus-agent-0-0\", resourceType=\"NETWORK-HOST\", > service=\"scheduler\", service=\"net-monitor-vnet-ovs\", verb=\"POST\"}" > timestamp=1715349164522 > ts=2024-05-11T08:42:26.967Z caller=write_handler.go:76 level=error > component=web msg="Error appending remote write" err="label name > \"service\" is not unique: invalid sample" > ts=2024-05-11T08:42:27.091Z caller=write_handler.go:134 level=error > component=web msg="unknown error from remote write" err="label name > \"prometheus_replica\" is not unique: invalid sample" > series="{__name__=\"workqueue_work_duration_seconds_sum\", > clusterName=\"clustertest150\", clusterRegion=\"region0\", > clusterZone=\"zone1\", endpoint=\"https\", instance=\"21.100.10.52:8443\", > job=\"metrics\", name=\"ResourceSyncController\", > namespace=\"service-ca-operator\", > pod=\"service-ca-operator-645cfdbfb6-rjr4z\", > prometheus=\"monitoring/agent-0\", > prometheus_replica=\"prometheus-agent-0-0\", > prometheus_replica=\"prometheus-agent-0-0\", service=\"metrics\"}" > timestamp=1715349271085 > ts=2024-05-11T08:42:27.091Z caller=write_handler.go:76 level=error > component=web msg="Error appending remote write" err="label name > \"prometheus_replica\" is not unique: invalid sample" > > Currently we dont' know why there are duplicated labels. But when the > server encounters duplicated labels, it returns 500. Then the agent keeps > retrying, which means new samples cannot be handled. > > we set external_labels in prometheus-agent configs: > global: > evaluation_interval: 30s > scrape_interval: 5m > scrape_timeout: 1m > external_labels: > clusterName: clustertest150 > clusterRegion: region0 > clusterZone: zone1 > prometheus: ccos-monitoring/agent-0 > prometheus_replica: prometheus-agent-0-0 > keep_dropped_targets: 1 > > and the remote write config: > remote_write: > - url: https://prometheus-k8s-0.monitoring:9091/api/v1/write > remote_timeout: 30s > name: prometheus-k8s-0 > write_relabel_configs: > - target_label: __tmp_cluster_id__ > replacement: 713c30cb-81c3-411d-b4dc-0c775a0f9564 > action: replace > - regex: __tmp_cluster_id__ > action: labeldrop > bearer_token: XDFSDF... > tls_config: > insecure_skip_verify: true > queue_config: > capacity: 10000 > min_shards: 1 > max_shards: 500 > max_samples_per_send: 2000 > batch_send_deadline: 10s > min_backoff: 30ms > max_backoff: 5s > sample_age_limit: 5m > > > You are saying that you would prefer the agent to throw away data, > rather than hold onto the data and try again later when it may succeed. In > this situation, retrying is normally the correct thing to do. > Yes, retry is the normal solution. But there should be maximum number of > retries. We notice that prometheus agent sets the retry nubmers to the > request header, but it seems the request header is not used by the server. > > prometheus-agent sets the retry numbers to request header: > > https://github.com/prometheus/prometheus/blob/release-2.51/storage/remote/client.go#L214 > > Besides, if some samples is incorrect and others are correct in the same > request, why don't prometheus server save the correct part and drop the > wrong part? It is more complicated as retry should be considered, but is it > possible to save partial data and return 206 when the maximum number of > retries is reached? > > And should prometheus server log samples for all kinds of error? > > https://github.com/prometheus/prometheus/blob/release-2.51/storage/remote/write_handler.go#L133 > > On Friday, May 17, 2024 at 8:15:04 PM UTC+8 Brian Candler wrote: > >> It's difficult to make sense of what you're saying. Without seeing logs >> from both the agent and the server while this problem was occurring (e.g. >> `journalctl -eu prometheus`), it's hard to know what was really happening. >> Also you need to say what exact versions of prometheus and the agent were >> running. >> >> The fundamental issue here is, why should restarting the *agent* cause >> the prometheus *server* to stop returning 500 errors? >> >> >> > So my question is why 5xx from the promtheus server is considered >> Recoverable? >> >> It is by definition of the HTTP protocol: >> https://datatracker.ietf.org/doc/html/rfc2616#section-10.5 >> >> Actually it depends on exactly which 5xx error code you're talking about, >> but common 500 and 503 errors are generally transient, meaning there was a >> problem at the server and the request may succeed if tried again later. If >> the prometheus server wanted to tell the client that the request was >> invalid and could never possibly succeed, then it would return a 4xx error. >> >> > And I believe there should be a way to exit the loop, for example a >> maximum times to retry. >> >> You are saying that you would prefer the agent to throw away data, rather >> than hold onto the data and try again later when it may succeed. In this >> situation, retrying is normally the correct thing to do. >> >> You may have come across a bug where a *particular* piece of data being >> sent by the agent was causing a *particular* version of prometheus to fail >> with a 5xx internal error every time. The logs should make it clear if this >> was happening. >> >> On Friday 17 May 2024 at 10:02:49 UTC+1 koly li wrote: >> >>> Hello all, >>> >>> Recently we found that our samples are all lost. After some >>> investigation, we found: >>> 1, we are using prometheus agent to send all data to prometheus server >>> by remote write >>> 2, the agent sample sending code is in storage\remote\queue_manager.go, >>> the function is sendWriteRequestWithBackoff() >>> 3, inside the function, if attempt(the function where request is made to >>> prometheus server) function returns an Recoverable Error, then it will >>> retry sending the request >>> 4, when a Recoverable error is returned? one scenario is the prometheus >>> server returned 5xx error >>> 5, I think not every 5xx error is recoverable, and there is no other way >>> to exit the for loop in sendWriteRequestWithBackoff(). The agent keeps >>> retrying but every time it receives an 5xx from the server. so we lost all >>> samples for hours until we restart the agent >>> >>> So my question is why 5xx from the promtheus server is considered >>> Recoverable? And I believe there should be a way to exit the loop, for >>> example a maximum times to retry. >>> >>> It seems that the agent mode is not mature enough to work in production. >>> >> -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/6d7fddc3-0006-4f54-902a-605253c0c670n%40googlegroups.com.

