Thanks a lot for the feedback so far! It's not a forgotten topic. We are actively gathering feedback from different projects/teams, and input from the Knative project is really valuable. There will be also two talks about monitoring short-living jobs at the next KubeCon EU:
* Operating Prometheus in a Serverless World <https://sched.co/yto1> - Colin Douch, Cloudflare * Fleeting Metrics: Monitoring Short-lived or Serverless Jobs with Prometheus <https://sched.co/zfKj>- Bartłomiej Płotka & Saswata Mukherjee, Red Hat We are working with Saswata and Colin on making sure we don't miss any requirements, so we can explain the current situation and propose a way forward. FYI: We are meeting tomorrow with the OpenFaas community to learn from them too: https://twitter.com/openfaas/status/1511266154005807107 if you want to join! 🤗 Kind Regards, Bartek On Wednesday, January 19, 2022 at 1:55:57 PM UTC [email protected] wrote: > Hi all! > > Hope not too late for the discussion. I would like to revive it as I find > it really useful for Knative and any serverless framework. As a Knative > contributor, working also on the monitoring side of the project, here is my > pov: > > a) OpenFaas as an example (mentioned earlier above) might not be the best > to consider as it seems that it only provides metrics > at the ingress side (Gateway), similarly to what you get from a Service > mesh like istio when you monitor its ingress. > Don't see any option to collect user metrics at least out of the box. > Another serverless system, Dapr, wrt tracing, has a sidecar that among > others pushes traces to the OTEL collector ( > https://docs.dapr.io/operations/monitoring/tracing/open-telemetry-collector). > Although Dapr for metrics uses a pull model still this highlights the path > they are taking. Knative btw supports different exporters and so it can > either use a pull model or a push model. It is not restricted to > opentelemetry at all. > > b) What is the targeted latency for serverless? In cloud environments it > is possible to get invocation latency down to milliseconds ( > https://aws.amazon.com/blogs/compute/creating-low-latency-high-volume-apis-with-provisioned-concurrency) > > for simple funcs and also minimize cold start issues. As a rule any > solution that ship metrics should take far less than the func run and > should not add considerable resource overhead. Also users depending on the > cost model should not pay for that overhead and you need to be able to > distinguish that somehow at least. Regarding latency some apps can tolerate > seconds or even minutes of latency. So it depends on how people want to > ship metrics given their scenario. Btw as a background info Knative cold > start time is a few seconds ( > https://groups.google.com/g/knative-users/c/vqkP95ibq60). > > c) There is a question whether serverless runtime should provide metrics > forwarding/collection. I would say it is possible for at least the > end-to-end traffic metrics.This is for metrics related to requests entering > the system eg. at ingress and usually each requests corresponds to a > function invocation (Knative has this 1-1 mapping). Ingress seems the right > point for robustness reasons. For example a request may fail at different > stages and this is also true for Knative where different components may be > on the request path. For any other metric including user metrics I would > say that a different localized approach for gathering metrics seems > preferable. Separation of concerns is one reason behind this as we dont > want centralized components to become a metric sink like a collector while > also doing other stuff like scaling apps etc. > > Looking at a possible generic solution, I would guess this to be based on > a local agent. Afaik a local tcp connection is at that ms scale including > time for sending a few kbs of metrics data. Of course this is not the only > option, metrics could be written to some local file and then stream its > contents (log solution mentioned above). Ideally an architecture that ships > metrics locally to some agent on a node would roughly satisfy reqs (which > should be captured btw in detail). That agent would then be possible to > push metrics to a metrics collector with either via remote writing, if it > is Prometheus based, or via some other way if it is OTEL node agent( > https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/design.md#running-as-an-agent) > > etc. This is already done elsewhere for example AWS ( > https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-Agent-open-telemetry.html > ). > > Best, > Stavros > > On Sunday, November 28, 2021 at 1:42:38 AM UTC+2 [email protected] wrote: > >> Just to throw my 2c in, we've been battling with this problem at >> (company) as we move more services to a serverless model for our customer >> facing things. Chiefly the issue of metrics aggregation for services that >> can't easily track their own state across multiple requests. For us, >> there's just too many metric semantics for different aggregations than can >> be expressed in Prometheus types, so we have resorted to hacks such as >> https://github.com/sinkingpoint/gravel-gateway to be able to express >> these. The wider variety of OpenMetrics types solves most of these issues, >> but that requires push gateway support as above, and a non zero effort from >> clients to migrate to OpenMetrics client libs (if those even exist for >> their languages of choice). >> >> For the above, _we_ answer the above in the following way: >> >> > What tradeoff would it make when metric ingestion is slower than metric >> production? Backpressure or drop data? >> >> Just drop it, with metrics to indicate as such >> >> > What are the semantics of pushing a counter? >> >> Aggregation by summing by default with different options available, >> configurable by the client >> >> > Where would the data move from there, and how? >> >> Exposed as per the push gateway as a regular Prometheus scrape >> >> > How many of these receivers would you typically run? How much >> coordination is necessary between them? >> >> This gets complicated. In our setup we have a daemonset in k8s and an >> ingress that does consistent hashing on the service name so that any given >> service is routed to two different instances >> >> Having run this setup in production for about a year and a half now it >> works for us in practice although it's definitely not ideal. We'd welcome >> some sort of official OpenMetrics solution >> >> - Colin >> >> >> On Sun, Nov 28, 2021 at 10:22 AM Matthias Rampke <[email protected]> >> wrote: >> >>> What properties would an ideal OpenMetrics push receiver have? In >>> particular, I am wondering: >>> >>> - What tradeoff would it make when metric ingestion is slower than >>> metric production? Backpressure or drop data? >>> - What are the semantics of pushing a counter? >>> - Where would the data move from there, and how? >>> - How many of these receivers would you typically run? How much >>> coordination is necessary between them? >>> >>> From observing the use of the statsd exporter, I see a few cases where >>> it covers ground that is not very compatible with the in-process >>> aggregation implied by the pull model. It has the downside of mapping >>> through a different metrics model, and its tradeoffs are informed by the >>> ones statsd made 10+ years ago. I wonder what it would look like, remade in >>> 2022 starting from OpenMetrics. >>> >>> >>> /MR >>> >>> On Sat, 27 Nov 2021, 12:50 Rob Skillington, <[email protected]> >>> wrote: >>> >>>> Here’s the documentation for using M3 coordinator (with it without M3 >>>> aggregator) with a backend that has a Prometheus Remote Write receiver: >>>> https://m3db.io/docs/how_to/any_remote_storage/ >>>> >>>> Would be more than happy to do a call some time on this topic, the more >>>> we’ve looked at this it’s a client library issue primarily way before you >>>> consider the backend/receiver aspect (which there are options out there >>>> and >>>> are fairly mechanical to overcome, vs the client library concerns which >>>> have a lot of ergonomic and practical issues especially in a serverless >>>> environment where you may need to wait for publishing before finishing >>>> your >>>> request - perhaps an async process like publishing a message to local >>>> serverless message queue like SQS is an option and having a reader read >>>> that and use another client library to push that data out is ideal - it >>>> would be more type safe and probably less lossy than logs and reading the >>>> logs then publishing but would need good client library support for both >>>> the serverless producers and the readers/pushers). >>>> >>>> Rob >>>> >>>> On Sat, Nov 27, 2021 at 1:41 AM Rob Skillington <[email protected]> >>>> wrote: >>>> >>>>> FWIW we have been experimenting with users pushing OpenMetrics >>>>> protobuf payloads quite successfully, but only sophisticated exporters >>>>> that >>>>> can guarantee no collisions of time series and generate their own >>>>> monotonic >>>>> counters, etc are using this at this time. >>>>> >>>>> If you're looking for a solution that also involves aggregation >>>>> support, M3 Coordinator (either standalone or combined with M3 >>>>> Aggregator) >>>>> supports Remote Write as a backend (and is thus compatible with Thanos, >>>>> Cortex and of course Prometheus itself too due to the PRW receiver). >>>>> >>>>> M3 Coordinator however does not have any nice support to publish to it >>>>> from a serverless environment (since the primary protocol it supports is >>>>> Prometheus Remote Write which has no metrics clients, etc I would assume). >>>>> >>>>> Rob >>>>> >>>>> >>>>> On Mon, Nov 15, 2021 at 9:54 PM Bartłomiej Płotka <[email protected]> >>>>> wrote: >>>>> >>>>>> Hi All, >>>>>> >>>>>> I would love to resurrect this thread. I think we are missing a good >>>>>> push-gateway like a product that would ideally live in Prometheus >>>>>> (repo/binary or can be recommended by us) and convert events to metrics >>>>>> in >>>>>> a cheap way. Because this is what it is when we talk about short-living >>>>>> containers and serverless functions. What's the latest Rob? I would be >>>>>> interested in some call for this if that is still on the table. (: >>>>>> >>>>>> I think we have some new options on the table like supporting Otel >>>>>> metrics as such potential high-cardinal event push, given there are more >>>>>> and more clients for that API. Potentially Otel collector can work as >>>>>> such >>>>>> "push gateway" proxy, but at this point, it's extremely generic, so we >>>>>> might want to consider something more focused/efficient/easier to >>>>>> maintain. >>>>>> Let's see (: The other problem is that Otel metrics is yet another >>>>>> protocol. Users might want to use push gateway API, remote write or >>>>>> logs/traces as per @Tobias Schmidt idea >>>>>> >>>>>> Another service "loggateway" (or otherwise named) would then stream >>>>>>> the logs, aggregate them and either expose them on the common /metrics >>>>>>> endpoint or push them with remote write right away to a Prometheus >>>>>>> instance >>>>>>> hosted somewhere (like Grafana Cloud)." >>>>>> >>>>>> >>>>>> Kind Regards, >>>>>> Bartek Płotka (@bwplotka) >>>>>> >>>>>> >>>>>> On Fri, Jun 25, 2021 at 6:11 AM Rob Skillington <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> With respect to OpenMetrics push, we had something very similar at >>>>>>> $prevco that pushed something that looked very similar to the protobuf >>>>>>> payload of OpenMetrics (but was Thrift snapshot of an aggregated set of >>>>>>> metrics from in process) that was used by short running tasks (for >>>>>>> Jenkins, >>>>>>> Flink jobs, etc). >>>>>>> >>>>>>> I definitely agree it’s not ideal and ideally the platform provider >>>>>>> can supply a collection point (there is something for Jenkins, a >>>>>>> plug-in >>>>>>> that can do this, but custom metrics is very hard / nigh impossible to >>>>>>> make >>>>>>> work with it, and this is a non-cloud provider environment that’s >>>>>>> actually >>>>>>> possible to make work, just no one has made it seamless). >>>>>>> >>>>>>> I agree with Richi that something that could push to a Prometheus >>>>>>> Agent like target that supports OpenMetrics push could be a good middle >>>>>>> ground with the right support / guidelines: >>>>>>> - A way to specify multiple Prometheus Agent targets and quickly >>>>>>> failover from one to another if within $X ms one is not responding (you >>>>>>> could imagine a 5ms budget for each and max 3 are tried, introducing at >>>>>>> worst 15ms overhead when all are down in 3 local availability zones, >>>>>>> but in >>>>>>> general this is a disaster case) >>>>>>> - Deduplication ability so that a retried push is not double >>>>>>> counted, this might mean timestamping the metrics… (so if written twice >>>>>>> only first record kept, etc) >>>>>>> >>>>>>> I think it should similar to the Push Gateway be generally a last >>>>>>> resort kind of option and have clear limitations so that pull still >>>>>>> remains >>>>>>> the clear choice for anything but these environments. >>>>>>> >>>>>>> Is there any interest discussing this on a call some time? >>>>>>> >>>>>>> Rob >>>>>>> >>>>>>> On Thu, Jun 24, 2021 at 5:09 PM Bjoern Rabenstein < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> On 22.06.21 11:26, Tobias Schmidt wrote: >>>>>>>> > >>>>>>>> > Last night I was wondering if there are any other common >>>>>>>> interfaces >>>>>>>> > available in serverless environments and noticed that all >>>>>>>> products by AWS >>>>>>>> > (Lambda) and GCP (Functions, Run) at least provide the option to >>>>>>>> handle log >>>>>>>> > streams, sometimes even log files on disk. I'm currently thinking >>>>>>>> about >>>>>>>> > experimenting with an approach where containers log metrics to >>>>>>>> stdout / >>>>>>>> > some file, get picked up by the serverless runtime and written to >>>>>>>> some log >>>>>>>> > stream. Another service "loggateway" (or otherwise named) would >>>>>>>> then stream >>>>>>>> > the logs, aggregate them and either expose them on the common >>>>>>>> /metrics >>>>>>>> > endpoint or push them with remote write right away to a >>>>>>>> Prometheus instance >>>>>>>> > hosted somewhere (like Grafana Cloud). >>>>>>>> >>>>>>>> Perhaps I'm missing something, but isn't that >>>>>>>> https://github.com/google/mtail ? >>>>>>>> >>>>>>>> -- >>>>>>>> Björn Rabenstein >>>>>>>> [PGP-ID] 0x851C3DA17D748D03 >>>>>>>> [email] [email protected] >>>>>>>> >>>>>>>> -- >>>>>>>> You received this message because you are subscribed to the Google >>>>>>>> Groups "Prometheus Developers" group. >>>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>>> send an email to [email protected]. >>>>>>>> To view this discussion on the web visit >>>>>>>> https://groups.google.com/d/msgid/prometheus-developers/20210624210908.GB11559%40jahnn >>>>>>>> . >>>>>>>> >>>>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "Prometheus Developers" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to [email protected]. >>>>> To view this discussion on the web visit >>>>> https://groups.google.com/d/msgid/prometheus-developers/CABakzZaGy-Rm1qv5%3D6-2ghjmDyW3k1YkO12YfWurHZmzfsv4-g%40mail.gmail.com >>>>> >>>>> <https://groups.google.com/d/msgid/prometheus-developers/CABakzZaGy-Rm1qv5%3D6-2ghjmDyW3k1YkO12YfWurHZmzfsv4-g%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "Prometheus Developers" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/prometheus-developers/CAFtK1UOa5ORJyui5-ORACtCMgS-82ZGz4G1T90EV6WY_RPDpqQ%40mail.gmail.com >>>> >>>> <https://groups.google.com/d/msgid/prometheus-developers/CAFtK1UOa5ORJyui5-ORACtCMgS-82ZGz4G1T90EV6WY_RPDpqQ%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>> . >>>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "Prometheus Developers" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> >> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/prometheus-developers/CAMV%3D_gb0ZYLNs%2B%2BYx9LSc885%3DivHMno7DPA3eEvjifgnD5Lx%3DQ%40mail.gmail.com >>> >>> <https://groups.google.com/d/msgid/prometheus-developers/CAMV%3D_gb0ZYLNs%2B%2BYx9LSc885%3DivHMno7DPA3eEvjifgnD5Lx%3DQ%40mail.gmail.com?utm_medium=email&utm_source=footer> >>> . >>> >> -- You received this message because you are subscribed to the Google Groups "Prometheus Developers" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/327b805e-094f-4a8b-8ecf-8128562adddfn%40googlegroups.com.

