Hello,
One of the biggest challenges we have when trying to run Prometheus with a
constantly growing number of scraped services is keeping resource usage
under control.
This usually means memory usage.
Cardinality is often a huge problem and we often end up with services
accidentally exposing labels that are risky. One silly mistake we see every
now and then is putting raw errors as labels, which then leads to time
series with {error="connection from $ip:$port to $ip:$port timed out"} and
so on.
We had a lot of way of dealing with this that uses vanilla Prometheus
features but none of it really works well for us.
Obviously there is sample_limit that one might use here, but the biggest
problem with it is the fact that once you hit sample_limit threshold you
lose all metrics, and that's just not acceptable for us.
If I have a service that exports 999 time series and it suddenly goes to
1001 (with sample_limit=1000) I really don't want to lose all metrics just
because of that because losing all monitoring is bigger problem than having
a few extra time series in Prometheus. It's just too risky.
We're currently running Prometheus with patches from:
https://github.com/prometheus/prometheus/pull/11124
This gives us 2 levels of protection:
- global HEAD limit - Prometheus is not allowed to have more than M time
series in TSDB
- per scrape sample_limit - but patched so that if you exceed sample_limit
it will start rejecting time series that aren't already in TSDB
This works well for us and gives us a system that:
- gives us reassurance that Prometheus won't start getting OOM killed
overnight
- service owners can add new metrics without fear that a typo will cost
them all metrics
But comments on that PR suggest that it's a highly controversial feature.
I wanted to probe this community to see what the overall feeling is and how
likely is that vanilla Prometheus will have something like this.
It's a small patch so I'm happy to just maintain it for our internal
deployments but it just feels like a common problem to me, so a baked in
solution would be great.
Lukasz
--
You received this message because you are subscribed to the Google Groups
"Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-developers/5ab29a58-e5a4-43c5-b662-4436db61f20an%40googlegroups.com.