Re: [prometheus-developers] Requirements / Best Practices to use Prometheus Metrics for Serverless environments

Colin Douch Sat, 27 Nov 2021 15:42:43 -0800

Just to throw my 2c in, we've been battling with this problem at (company)
as we move more services to a serverless model for our customer facing
things. Chiefly the issue of metrics aggregation for services that can't
easily track their own state across multiple requests. For us, there's just
too many metric semantics for different aggregations than can be expressed
in Prometheus types, so we have resorted to hacks such as
https://github.com/sinkingpoint/gravel-gateway to be able to express these.
The wider variety of OpenMetrics types solves most of these issues, but
that requires push gateway support as above, and a non zero effort from
clients to migrate to OpenMetrics client libs (if those even exist for
their languages of choice).


For the above, _we_ answer the above in the following way:

> What tradeoff would it make when metric ingestion is slower than metric
production? Backpressure or drop data?

Just drop it, with metrics to indicate as such

> What are the semantics of pushing a counter?

Aggregation by summing by default with different options available,
configurable by the client

> Where would the data move from there, and how?

Exposed as per the push gateway as a regular Prometheus scrape

> How many of these receivers would you typically run? How much
coordination is necessary between them?

This gets complicated. In our setup we have a daemonset in k8s and an
ingress that does consistent hashing on the service name so that any given
service is routed to two different instances

Having run this setup in production for about a year and a half now it
works for us in practice although it's definitely not ideal. We'd welcome
some sort of official OpenMetrics solution

- Colin


On Sun, Nov 28, 2021 at 10:22 AM Matthias Rampke <[email protected]>
wrote:

> What properties would an ideal OpenMetrics push receiver have? In
> particular, I am wondering:
>
> - What tradeoff would it make when metric ingestion is slower than metric
> production? Backpressure or drop data?
> - What are the semantics of pushing a counter?
> - Where would the data move from there, and how?
> - How many of these receivers would you typically run? How much
> coordination is necessary between them?
>
> From observing the use of the statsd exporter, I see a few cases where it
> covers ground that is not very compatible with the in-process aggregation
> implied by the pull model. It has the downside of mapping through a
> different metrics model, and its tradeoffs are informed by the ones statsd
> made 10+ years ago. I wonder what it would look like, remade in 2022
> starting from OpenMetrics.
>
>
> /MR
>
> On Sat, 27 Nov 2021, 12:50 Rob Skillington, <[email protected]>
> wrote:
>
>> Here’s the documentation for using M3 coordinator (with it without M3
>> aggregator) with a backend that has a Prometheus Remote Write receiver:
>> https://m3db.io/docs/how_to/any_remote_storage/
>>
>> Would be more than happy to do a call some time on this topic, the more
>> we’ve looked at this it’s a client library issue primarily way before you
>> consider the backend/receiver aspect (which there are options out there and
>> are fairly mechanical to overcome, vs the client library concerns which
>> have a lot of ergonomic and practical issues especially in a serverless
>> environment where you may need to wait for publishing before finishing your
>> request - perhaps an async process like publishing a message to local
>> serverless message queue like SQS is an option and having a reader read
>> that and use another client library to push that data out is ideal - it
>> would be more type safe and probably less lossy than logs and reading the
>> logs then publishing but would need good client library support for both
>> the serverless producers and the readers/pushers).
>>
>> Rob
>>
>> On Sat, Nov 27, 2021 at 1:41 AM Rob Skillington <[email protected]>
>> wrote:
>>
>>> FWIW we have been experimenting with users pushing OpenMetrics protobuf
>>> payloads quite successfully, but only sophisticated exporters that can
>>> guarantee no collisions of time series and generate their own monotonic
>>> counters, etc are using this at this time.
>>>
>>> If you're looking for a solution that also involves aggregation support,
>>> M3 Coordinator (either standalone or combined with M3 Aggregator) supports
>>> Remote Write as a backend (and is thus compatible with Thanos, Cortex and
>>> of course Prometheus itself too due to the PRW receiver).
>>>
>>> M3 Coordinator however does not have any nice support to publish to it
>>> from a serverless environment (since the primary protocol it supports is
>>> Prometheus Remote Write which has no metrics clients, etc I would assume).
>>>
>>> Rob
>>>
>>>
>>> On Mon, Nov 15, 2021 at 9:54 PM Bartłomiej Płotka <[email protected]>
>>> wrote:
>>>
>>>> Hi All,
>>>>
>>>> I would love to resurrect this thread. I think we are missing a good
>>>> push-gateway like a product that would ideally live in Prometheus
>>>> (repo/binary or can be recommended by us) and convert events to metrics in
>>>> a cheap way. Because this is what it is when we talk about short-living
>>>> containers and serverless functions. What's the latest Rob? I would be
>>>> interested in some call for this if that is still on the table. (:
>>>>
>>>> I think we have some new options on the table like supporting Otel
>>>> metrics as such potential high-cardinal event push, given there are more
>>>> and more clients for that API. Potentially Otel collector can work as such
>>>> "push gateway" proxy, but at this point, it's extremely generic, so we
>>>> might want to consider something more focused/efficient/easier to maintain.
>>>> Let's see (: The other problem is that Otel metrics is yet another
>>>> protocol. Users might want to use push gateway API, remote write or
>>>> logs/traces as per @Tobias Schmidt <[email protected]> idea
>>>>
>>>> Another service "loggateway" (or otherwise named) would then stream the
>>>>> logs, aggregate them and either expose them on the common /metrics 
>>>>> endpoint
>>>>> or push them with remote write right away to a Prometheus instance hosted
>>>>> somewhere (like Grafana Cloud)."
>>>>
>>>>
>>>> Kind Regards,
>>>> Bartek Płotka (@bwplotka)
>>>>
>>>>
>>>> On Fri, Jun 25, 2021 at 6:11 AM Rob Skillington <[email protected]>
>>>> wrote:
>>>>
>>>>> With respect to OpenMetrics push, we had something very similar at
>>>>> $prevco that pushed something that looked very similar to the protobuf
>>>>> payload of OpenMetrics (but was Thrift snapshot of an aggregated set of
>>>>> metrics from in process) that was used by short running tasks (for 
>>>>> Jenkins,
>>>>> Flink jobs, etc).
>>>>>
>>>>> I definitely agree it’s not ideal and ideally the platform provider
>>>>> can supply a collection point (there is something for Jenkins, a plug-in
>>>>> that can do this, but custom metrics is very hard / nigh impossible to 
>>>>> make
>>>>> work with it, and this is a non-cloud provider environment that’s actually
>>>>> possible to make work, just no one has made it seamless).
>>>>>
>>>>> I agree with Richi that something that could push to a Prometheus
>>>>> Agent like target that supports OpenMetrics push could be a good middle
>>>>> ground with the right support / guidelines:
>>>>> - A way to specify multiple Prometheus Agent targets and quickly
>>>>> failover from one to another if within $X ms one is not responding (you
>>>>> could imagine a 5ms budget for each and max 3 are tried, introducing at
>>>>> worst 15ms overhead when all are down in 3 local availability zones, but 
>>>>> in
>>>>> general this is a disaster case)
>>>>> - Deduplication ability so that a retried push is not double counted,
>>>>> this might mean timestamping the metrics… (so if written twice only first
>>>>> record kept, etc)
>>>>>
>>>>> I think it should similar to the Push Gateway be generally a last
>>>>> resort kind of option and have clear limitations so that pull still 
>>>>> remains
>>>>> the clear choice for anything but these environments.
>>>>>
>>>>> Is there any interest discussing this on a call some time?
>>>>>
>>>>> Rob
>>>>>
>>>>> On Thu, Jun 24, 2021 at 5:09 PM Bjoern Rabenstein <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> On 22.06.21 11:26, Tobias Schmidt wrote:
>>>>>> >
>>>>>> > Last night I was wondering if there are any other common interfaces
>>>>>> > available in serverless environments and noticed that all products
>>>>>> by AWS
>>>>>> > (Lambda) and GCP (Functions, Run) at least provide the option to
>>>>>> handle log
>>>>>> > streams, sometimes even log files on disk. I'm currently thinking
>>>>>> about
>>>>>> > experimenting with an approach where containers log metrics to
>>>>>> stdout /
>>>>>> > some file, get picked up by the serverless runtime and written to
>>>>>> some log
>>>>>> > stream. Another service "loggateway" (or otherwise named) would
>>>>>> then stream
>>>>>> > the logs, aggregate them and either expose them on the common
>>>>>> /metrics
>>>>>> > endpoint or push them with remote write right away to a Prometheus
>>>>>> instance
>>>>>> > hosted somewhere (like Grafana Cloud).
>>>>>>
>>>>>> Perhaps I'm missing something, but isn't that
>>>>>> https://github.com/google/mtail ?
>>>>>>
>>>>>> --
>>>>>> Björn Rabenstein
>>>>>> [PGP-ID] 0x851C3DA17D748D03
>>>>>> [email] [email protected]
>>>>>>
>>>>>> --
>>>>>> You received this message because you are subscribed to the Google
>>>>>> Groups "Prometheus Developers" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>> send an email to [email protected].
>>>>>> To view this discussion on the web visit
>>>>>> https://groups.google.com/d/msgid/prometheus-developers/20210624210908.GB11559%40jahnn
>>>>>> .
>>>>>>
>>>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "Prometheus Developers" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/prometheus-developers/CABakzZaGy-Rm1qv5%3D6-2ghjmDyW3k1YkO12YfWurHZmzfsv4-g%40mail.gmail.com
>>> <https://groups.google.com/d/msgid/prometheus-developers/CABakzZaGy-Rm1qv5%3D6-2ghjmDyW3k1YkO12YfWurHZmzfsv4-g%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "Prometheus Developers" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/prometheus-developers/CAFtK1UOa5ORJyui5-ORACtCMgS-82ZGz4G1T90EV6WY_RPDpqQ%40mail.gmail.com
>> <https://groups.google.com/d/msgid/prometheus-developers/CAFtK1UOa5ORJyui5-ORACtCMgS-82ZGz4G1T90EV6WY_RPDpqQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>>
> --
> You received this message because you are subscribed to the Google Groups
> "Prometheus Developers" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/prometheus-developers/CAMV%3D_gb0ZYLNs%2B%2BYx9LSc885%3DivHMno7DPA3eEvjifgnD5Lx%3DQ%40mail.gmail.com
> <https://groups.google.com/d/msgid/prometheus-developers/CAMV%3D_gb0ZYLNs%2B%2BYx9LSc885%3DivHMno7DPA3eEvjifgnD5Lx%3DQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-developers/CAGb-_uUemNG-nQS8%3D57GDifrunYzmDAO5%3DSr4MAe3UpOf9WPgQ%40mail.gmail.com.

Re: [prometheus-developers] Requirements / Best Practices to use Prometheus Metrics for Serverless environments

Reply via email to