So, I want to be really clear that I'm saying this with a "using Prometheus at my job" hat on rather than a Prometheus team one.
I recently came to a similar conclusion that the Pushgateway isn't practical to use for anything other than pushing a last completion time for a given job (or instance of a job). I have a handful of cron jobs that run in the background in kube clusters, perform some work, and push some metrics to say they've done that. Right now they only push a gauge to say that they successfuly ran at the current time. I would love for them to push some counters about how many items they processed. The series of unworkable approaches I ran through were: - Use Pushgateway without a pod label: counter values get replaced, and if you typically process tiny numbers of events (like low single digits), it can confuse restart detection in rate functions - Use Pushgateway with a pod label: the unbounded memory growth problem you discussed, and unwanted timeseries churn in Prometheus itself - Use prom-aggregation-gateway <https://github.com/zapier/prom-aggregation-gateway>: It sums gauges as well as counters <https://github.com/zapier/prom-aggregation-gateway/issues/65>, which makes no sense when you're pushing a timestamp gauge I may well have missed something while trying to find a solution, and I'd be really happy to hear that. For the sake of doing something relatively quick at work, I'm going to be contributing a Prometheus push source to Vector <https://github.com/vectordotdev/vector/issues/10304#issuecomment-1516858846> soon, which will let me sum counters and replace gauges. This approach will still be subject to the usual caveats of cramming push metrics into Prometheus' pull model, but for this style of job it honestly feels like the right compromise. With the team hat back on, I'd be curious if there's any appetite for a first-party solution that looks something like this. The Vector solution will get the job done, but I do feel a bit silly bringing in an any-to-any logs and metrics processor to only use features from it that live entirely within the Prometheus ecosystem. On Thursday, 15 June 2023 at 05:20:24 UTC+1 Braden Schaeffer wrote: > I wanted to reopen this discussion because I am having a very difficult > time understanding how pushgateway can be the suggested solution for batch > job metric collection, yet simultaneously batch jobs are not a great use > case example for why metric TTLs are needed in push gateway. > > The most basic example, two batch jobs that produce the same metrics (grpc > or http metrics). This is not just `last_completed_at` or something as I > have seen before where its the same metric being updated over and over > agin. You have to include a label that identifies these jobs as different > so that metrics like gRPC request rates can be calculated correctly. In the > kubernetes world this usually means pod ID. Simple enough until you have > 1000s of these pod IDs compounded by other labels. > > By now we all know those metrics are going to stay around forever, but I > don't understand why the answer to this problem is "this is not a a good > use case". For push gateway? For TTL? What am I doing wrong? > > I've got a pipeline and library code streamlined for prometheus metric > collection and the only solution I have seen offered at all is "use > statsd". No. That's silly. I need new clients and two ways of defining > metrics in code to account for each potential storage solution. Two APIs. > etc. > > Can someone please help me understand why pushgateway's existence is not > reason enough to implement TTL? > -- You received this message because you are subscribed to the Google Groups "Prometheus Developers" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/907952d5-86af-45b9-b695-872141779cd4n%40googlegroups.com.

