Re: [prometheus-developers] Requirements / Best Practices to use Prometheus Metrics for Serverless environments

Bartłomiej Płotka Tue, 05 Apr 2022 03:27:01 -0700

Thanks a lot for the feedback so far!

It's not a forgotten topic. We are actively gathering feedback from 
different projects/teams, and input from the Knative project is really 
valuable. There will be also two talks about monitoring short-living jobs 
at the next KubeCon EU:


* Operating Prometheus in a Serverless World <https://sched.co/yto1> - 
Colin Douch, Cloudflare
* Fleeting Metrics: Monitoring Short-lived or Serverless Jobs with 
Prometheus <https://sched.co/zfKj>- Bartłomiej Płotka & Saswata Mukherjee, 
Red Hat

We are working with Saswata and Colin on making sure we don't miss any 
requirements, so we can explain the current situation and propose a way 
forward.
 
FYI: We are meeting tomorrow with the OpenFaas community to learn from them 
too: https://twitter.com/openfaas/status/1511266154005807107 if you want to 
join! 🤗

Kind Regards,
Bartek


On Wednesday, January 19, 2022 at 1:55:57 PM UTC [email protected] wrote:

> Hi all! 
>
> Hope not too late for the discussion. I would like to revive it as I find 
> it really useful for Knative and any serverless framework. As a Knative 
> contributor, working also on the monitoring side of the project, here is my 
> pov: 
>
> a) OpenFaas as an example (mentioned earlier above) might not be the best 
> to consider as it seems that it only provides metrics
> at the ingress side (Gateway), similarly to what you get from a Service 
> mesh like istio when you monitor its ingress.
> Don't see any option to collect user metrics at least out of the box. 
> Another serverless system, Dapr, wrt tracing, has a sidecar that among 
> others pushes traces to the OTEL collector  (
> https://docs.dapr.io/operations/monitoring/tracing/open-telemetry-collector). 
> Although Dapr for metrics uses a pull model still this highlights the path 
> they are taking. Knative btw supports different exporters and so it can 
> either use a pull model or a push model. It is not restricted to 
> opentelemetry at all.
>
> b) What is the targeted latency for serverless? In cloud environments it 
> is possible to get invocation latency down to milliseconds (
> https://aws.amazon.com/blogs/compute/creating-low-latency-high-volume-apis-with-provisioned-concurrency)
>  
> for simple funcs and also minimize cold start issues. As a rule any 
> solution that ship metrics should take far less than the func run and 
> should not add considerable resource overhead. Also users depending on the 
> cost model should not pay for that overhead and you need to be able to 
> distinguish that somehow at least. Regarding latency some apps can tolerate 
> seconds or even minutes of latency. So it depends on how people want to 
> ship metrics given their scenario. Btw as a background info Knative cold 
> start time is a few seconds (
> https://groups.google.com/g/knative-users/c/vqkP95ibq60).
>
> c) There is a question whether serverless runtime should provide metrics 
> forwarding/collection. I would say it is possible for at least the 
> end-to-end traffic metrics.This is for metrics related to requests entering 
> the system eg. at ingress and usually each requests corresponds to a 
> function invocation (Knative has this 1-1 mapping). Ingress seems the right 
> point for robustness reasons. For example a request may fail at different 
> stages and this is also true for Knative where different components may be 
> on the request path. For any other metric including user metrics I would 
> say that a different localized approach for gathering metrics seems 
> preferable. Separation of concerns is one reason behind this as we dont 
> want centralized components to become a metric sink like a collector while 
> also doing other stuff like scaling apps etc. 
>
> Looking at a possible generic solution, I would guess this to be based on 
> a local agent. Afaik a local tcp connection is at that ms scale including 
> time for sending a few kbs of metrics data. Of course this is not the only 
> option, metrics could be written to some local file and then stream its 
> contents (log solution mentioned above). Ideally an architecture that ships 
> metrics locally to some agent on a node would roughly satisfy reqs (which 
> should be captured btw in detail). That agent would then be possible to 
> push metrics to a metrics collector with either via remote writing, if it 
> is Prometheus based, or via some other way if it is OTEL node agent(
> https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/design.md#running-as-an-agent)
>  
> etc. This is already done elsewhere for example AWS (
> https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-Agent-open-telemetry.html
> ).
>
> Best,
> Stavros
>
> On Sunday, November 28, 2021 at 1:42:38 AM UTC+2 [email protected] wrote:
>
>> Just to throw my 2c in, we've been battling with this problem at 
>> (company) as we move more services to a serverless model for our customer 
>> facing things. Chiefly the issue of metrics aggregation for services that 
>> can't easily track their own state across multiple requests. For us, 
>> there's just too many metric semantics for different aggregations than can 
>> be expressed in Prometheus types, so we have resorted to hacks such as 
>> https://github.com/sinkingpoint/gravel-gateway to be able to express 
>> these. The wider variety of OpenMetrics types solves most of these issues, 
>> but that requires push gateway support as above, and a non zero effort from 
>> clients to migrate to OpenMetrics client libs (if those even exist for 
>> their languages of choice).
>>
>> For the above, _we_ answer the above in the following way:
>>
>> > What tradeoff would it make when metric ingestion is slower than metric 
>> production? Backpressure or drop data?
>>
>> Just drop it, with metrics to indicate as such
>>
>> > What are the semantics of pushing a counter?
>>
>> Aggregation by summing by default with different options available, 
>> configurable by the client
>>
>> > Where would the data move from there, and how?
>>
>> Exposed as per the push gateway as a regular Prometheus scrape
>>
>> > How many of these receivers would you typically run? How much 
>> coordination is necessary between them?
>>
>> This gets complicated. In our setup we have a daemonset in k8s and an 
>> ingress that does consistent hashing on the service name so that any given 
>> service is routed to two different instances
>>
>> Having run this setup in production for about a year and a half now it 
>> works for us in practice although it's definitely not ideal. We'd welcome 
>> some sort of official OpenMetrics solution
>>
>> - Colin
>>
>>
>> On Sun, Nov 28, 2021 at 10:22 AM Matthias Rampke <[email protected]> 
>> wrote:
>>
>>> What properties would an ideal OpenMetrics push receiver have? In 
>>> particular, I am wondering:
>>>
>>> - What tradeoff would it make when metric ingestion is slower than 
>>> metric production? Backpressure or drop data?
>>> - What are the semantics of pushing a counter?
>>> - Where would the data move from there, and how?
>>> - How many of these receivers would you typically run? How much 
>>> coordination is necessary between them?
>>>
>>> From observing the use of the statsd exporter, I see a few cases where 
>>> it covers ground that is not very compatible with the in-process 
>>> aggregation implied by the pull model. It has the downside of mapping 
>>> through a different metrics model, and its tradeoffs are informed by the 
>>> ones statsd made 10+ years ago. I wonder what it would look like, remade in 
>>> 2022 starting from OpenMetrics.
>>>
>>>
>>> /MR
>>>
>>> On Sat, 27 Nov 2021, 12:50 Rob Skillington, <[email protected]> 
>>> wrote:
>>>
>>>> Here’s the documentation for using M3 coordinator (with it without M3 
>>>> aggregator) with a backend that has a Prometheus Remote Write receiver:
>>>> https://m3db.io/docs/how_to/any_remote_storage/
>>>>
>>>> Would be more than happy to do a call some time on this topic, the more 
>>>> we’ve looked at this it’s a client library issue primarily way before you 
>>>> consider the backend/receiver aspect (which there are options out there 
>>>> and 
>>>> are fairly mechanical to overcome, vs the client library concerns which 
>>>> have a lot of ergonomic and practical issues especially in a serverless 
>>>> environment where you may need to wait for publishing before finishing 
>>>> your 
>>>> request - perhaps an async process like publishing a message to local 
>>>> serverless message queue like SQS is an option and having a reader read 
>>>> that and use another client library to push that data out is ideal - it 
>>>> would be more type safe and probably less lossy than logs and reading the 
>>>> logs then publishing but would need good client library support for both 
>>>> the serverless producers and the readers/pushers).
>>>>
>>>> Rob
>>>>
>>>> On Sat, Nov 27, 2021 at 1:41 AM Rob Skillington <[email protected]> 
>>>> wrote:
>>>>
>>>>> FWIW we have been experimenting with users pushing OpenMetrics 
>>>>> protobuf payloads quite successfully, but only sophisticated exporters 
>>>>> that 
>>>>> can guarantee no collisions of time series and generate their own 
>>>>> monotonic 
>>>>> counters, etc are using this at this time.
>>>>>
>>>>> If you're looking for a solution that also involves aggregation 
>>>>> support, M3 Coordinator (either standalone or combined with M3 
>>>>> Aggregator) 
>>>>> supports Remote Write as a backend (and is thus compatible with Thanos, 
>>>>> Cortex and of course Prometheus itself too due to the PRW receiver).
>>>>>
>>>>> M3 Coordinator however does not have any nice support to publish to it 
>>>>> from a serverless environment (since the primary protocol it supports is 
>>>>> Prometheus Remote Write which has no metrics clients, etc I would assume).
>>>>>
>>>>> Rob
>>>>>
>>>>>
>>>>> On Mon, Nov 15, 2021 at 9:54 PM Bartłomiej Płotka <[email protected]> 
>>>>> wrote:
>>>>>
>>>>>> Hi All,
>>>>>>
>>>>>> I would love to resurrect this thread. I think we are missing a good 
>>>>>> push-gateway like a product that would ideally live in Prometheus 
>>>>>> (repo/binary or can be recommended by us) and convert events to metrics 
>>>>>> in 
>>>>>> a cheap way. Because this is what it is when we talk about short-living 
>>>>>> containers and serverless functions. What's the latest Rob? I would be 
>>>>>> interested in some call for this if that is still on the table. (: 
>>>>>>
>>>>>> I think we have some new options on the table like supporting Otel 
>>>>>> metrics as such potential high-cardinal event push, given there are more 
>>>>>> and more clients for that API. Potentially Otel collector can work as 
>>>>>> such 
>>>>>> "push gateway" proxy, but at this point, it's extremely generic, so we 
>>>>>> might want to consider something more focused/efficient/easier to 
>>>>>> maintain. 
>>>>>> Let's see (: The other problem is that Otel metrics is yet another 
>>>>>> protocol. Users might want to use push gateway API, remote write or 
>>>>>> logs/traces as per @Tobias Schmidt idea 
>>>>>>
>>>>>> Another service "loggateway" (or otherwise named) would then stream 
>>>>>>> the logs, aggregate them and either expose them on the common /metrics 
>>>>>>> endpoint or push them with remote write right away to a Prometheus 
>>>>>>> instance 
>>>>>>> hosted somewhere (like Grafana Cloud)."
>>>>>>
>>>>>>
>>>>>> Kind Regards,
>>>>>> Bartek Płotka (@bwplotka)
>>>>>>
>>>>>>
>>>>>> On Fri, Jun 25, 2021 at 6:11 AM Rob Skillington <[email protected]> 
>>>>>> wrote:
>>>>>>
>>>>>>> With respect to OpenMetrics push, we had something very similar at 
>>>>>>> $prevco that pushed something that looked very similar to the protobuf 
>>>>>>> payload of OpenMetrics (but was Thrift snapshot of an aggregated set of 
>>>>>>> metrics from in process) that was used by short running tasks (for 
>>>>>>> Jenkins, 
>>>>>>> Flink jobs, etc).
>>>>>>>
>>>>>>> I definitely agree it’s not ideal and ideally the platform provider 
>>>>>>> can supply a collection point (there is something for Jenkins, a 
>>>>>>> plug-in 
>>>>>>> that can do this, but custom metrics is very hard / nigh impossible to 
>>>>>>> make 
>>>>>>> work with it, and this is a non-cloud provider environment that’s 
>>>>>>> actually 
>>>>>>> possible to make work, just no one has made it seamless).
>>>>>>>
>>>>>>> I agree with Richi that something that could push to a Prometheus 
>>>>>>> Agent like target that supports OpenMetrics push could be a good middle 
>>>>>>> ground with the right support / guidelines:
>>>>>>> - A way to specify multiple Prometheus Agent targets and quickly 
>>>>>>> failover from one to another if within $X ms one is not responding (you 
>>>>>>> could imagine a 5ms budget for each and max 3 are tried, introducing at 
>>>>>>> worst 15ms overhead when all are down in 3 local availability zones, 
>>>>>>> but in 
>>>>>>> general this is a disaster case)
>>>>>>> - Deduplication ability so that a retried push is not double 
>>>>>>> counted, this might mean timestamping the metrics… (so if written twice 
>>>>>>> only first record kept, etc)
>>>>>>>
>>>>>>> I think it should similar to the Push Gateway be generally a last 
>>>>>>> resort kind of option and have clear limitations so that pull still 
>>>>>>> remains 
>>>>>>> the clear choice for anything but these environments.
>>>>>>>
>>>>>>> Is there any interest discussing this on a call some time?
>>>>>>>
>>>>>>> Rob
>>>>>>>
>>>>>>> On Thu, Jun 24, 2021 at 5:09 PM Bjoern Rabenstein <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> On 22.06.21 11:26, Tobias Schmidt wrote:
>>>>>>>> > 
>>>>>>>> > Last night I was wondering if there are any other common 
>>>>>>>> interfaces
>>>>>>>> > available in serverless environments and noticed that all 
>>>>>>>> products by AWS
>>>>>>>> > (Lambda) and GCP (Functions, Run) at least provide the option to 
>>>>>>>> handle log
>>>>>>>> > streams, sometimes even log files on disk. I'm currently thinking 
>>>>>>>> about
>>>>>>>> > experimenting with an approach where containers log metrics to 
>>>>>>>> stdout /
>>>>>>>> > some file, get picked up by the serverless runtime and written to 
>>>>>>>> some log
>>>>>>>> > stream. Another service "loggateway" (or otherwise named) would 
>>>>>>>> then stream
>>>>>>>> > the logs, aggregate them and either expose them on the common 
>>>>>>>> /metrics
>>>>>>>> > endpoint or push them with remote write right away to a 
>>>>>>>> Prometheus instance
>>>>>>>> > hosted somewhere (like Grafana Cloud).
>>>>>>>>
>>>>>>>> Perhaps I'm missing something, but isn't that
>>>>>>>> https://github.com/google/mtail ?
>>>>>>>>
>>>>>>>> -- 
>>>>>>>> Björn Rabenstein
>>>>>>>> [PGP-ID] 0x851C3DA17D748D03
>>>>>>>> [email] [email protected]
>>>>>>>>
>>>>>>>> -- 
>>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>>> Groups "Prometheus Developers" group.
>>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>>> send an email to [email protected].
>>>>>>>> To view this discussion on the web visit 
>>>>>>>> https://groups.google.com/d/msgid/prometheus-developers/20210624210908.GB11559%40jahnn
>>>>>>>> .
>>>>>>>>
>>>>>>> -- 
>>>>> You received this message because you are subscribed to the Google 
>>>>> Groups "Prometheus Developers" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>>> an email to [email protected].
>>>>> To view this discussion on the web visit 
>>>>> https://groups.google.com/d/msgid/prometheus-developers/CABakzZaGy-Rm1qv5%3D6-2ghjmDyW3k1YkO12YfWurHZmzfsv4-g%40mail.gmail.com
>>>>>  
>>>>> <https://groups.google.com/d/msgid/prometheus-developers/CABakzZaGy-Rm1qv5%3D6-2ghjmDyW3k1YkO12YfWurHZmzfsv4-g%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>>
>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "Prometheus Developers" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected].
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/prometheus-developers/CAFtK1UOa5ORJyui5-ORACtCMgS-82ZGz4G1T90EV6WY_RPDpqQ%40mail.gmail.com
>>>>  
>>>> <https://groups.google.com/d/msgid/prometheus-developers/CAFtK1UOa5ORJyui5-ORACtCMgS-82ZGz4G1T90EV6WY_RPDpqQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "Prometheus Developers" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to [email protected].
>>>
>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/prometheus-developers/CAMV%3D_gb0ZYLNs%2B%2BYx9LSc885%3DivHMno7DPA3eEvjifgnD5Lx%3DQ%40mail.gmail.com
>>>  
>>> <https://groups.google.com/d/msgid/prometheus-developers/CAMV%3D_gb0ZYLNs%2B%2BYx9LSc885%3DivHMno7DPA3eEvjifgnD5Lx%3DQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-developers/327b805e-094f-4a8b-8ecf-8128562adddfn%40googlegroups.com.

Re: [prometheus-developers] Requirements / Best Practices to use Prometheus Metrics for Serverless environments

Reply via email to