Re: [prometheus-developers] Requirements / Best Practices to use Prometheus Metrics for Serverless environments

Tobias Schmidt Tue, 22 Jun 2021 02:26:17 -0700

Thanks for bringing up this topic Bartek and your great insights Björn!

On Sat, Jun 19, 2021 at 12:16 AM Bjoern Rabenstein <[email protected]>
wrote:

> On 15.06.21 20:59, Bartłomiej Płotka wrote:
> >
> > Let's now talk about FaaS/Serverless.
>
> Excellent! That's my 2nd favorite topic after histograms. (And while I
> provably talked about histograms as my favorite topic since early
> 2015, I have only started to talk about FaaS/Serverless as an
> important gap to fill in the Prometheus story since 2018.)
>
> I think "true FaaS" means that the function calls are
> lightweight. The additional overhead of sending anything over the
> networks defeats that purpose. So similar to what has been said
> before, and what Bartek has already nicely worked out, I think the
> metrics have to be managed by the FaaS runtime, in the same path as
> billing is managed.
>
> And that's, of course, what cloud providers are doing, and it's also a
> formidable way of locking their customers into their own metrics and
> monitoring system.
>
> And that's in turn precisely where I think Prometheus can use its
> weight. Prometheus has already proven that cloud providers can
> essentially not get away with ignoring it, and even halfhearted
> integrations won't be enough. With more or less native Prometheus
> support by cloud providers, it might actually just require a small
> step to come to some convention how to collect and present FaaS
> metrics in a "Promethean" way. If all cloud providers do it the same
> way, the lock-in is gone.
>
> I think it would be very valuable to study what OpenFaaS has already
> done: https://docs.openfaas.com/architecture/metrics/
>
> In the simplest case, we could just say: Please, dear cloud providers,
> please expose exactly the same metrics for general benefit. If there
> is anything to improve with the OpenFaaS approach, I'm sure they will
> be delighted to get help. (Spontaneously, I'm missing a way to define
> custom metrics, e.g. how many records a function call has processed.)
>

I think it's a great idea to open the discussion with the big cloud
providers about an open runtime integration for metrics. Maybe they're more
open about this than I expect. My fear is that this won't really lead to
any substantial improvement, as the vendor lock-in seems to be quite
desired judging from my personal experience with cloud providers.

> > * Suggestion to use event aggregation proxy
> > <https://github.com/weaveworks/prom-aggregation-gateway>
> > * Pushgateway improvements
> > <
> https://groups.google.com/g/prometheus-users/c/sm5qOrsVY80/m/nSfbzHd9AgAJ>
> for
> > serverless cases
>
> Despite all of what I said above, I think there _are_ quite a few user
> of FaaS who have fairly heavy-weight function calls. For them, pushing
> counter increments etc. via the network might actually be more
> convenient than funneling metrics through the FaaS runtime. This is
> then just another use-case of the "distributed counter" idea, which
> the Pushgateway quite prominently is not catering for. As discussed
> in the thread linked above and at countless other places, I strongly
> recommend to not shoehorn the Pushgateway into this use-case, but
> create a separate project for it, which would be designed from the
> beginning for this use-case. Perhaps
> weaveworks/prom-aggregation-gateway is just that. I haven't studied it
> in detail yet. In a way, we need "statsd done right". Again, I would
> suggest to look what others have already done. For example, there are
> tons of statsd users out there. What have they done in the last years
> to overcome the known shortcomings? Perhaps statsd instrumentation and
> the Prometheus statsd exporter just needs a bit of development in that
> way to make it a viable solution.
>

First of all, I wonder if there is really any difference in terms of
heavy-weight/light-weight classification of serverless / FaaS in contrast
to traditional deployment styles. Personally the reason I chose a
serverless runtime (GCP Cloud Run) for my application layer is just in
order to focus on business feature development. The runtime manages
container lifecycles and we're only paying for the time containers serve
traffic. I could deploy the exact same Docker container outside a
serverless environment as well.
My needs are still the same though: I want to instrument the various
aspects of the service and its many endpoints, both with common request
related metrics as well as custom metrics. The problem I face is the
fundamental mismatch of Prometheus' pull architecture and the serverless
runtime which doesn't even allow me to see individual container instances.

The StatsD / push-over-network approach has some serious latency impact as
you both highlighted already. Additionally, it requires the deployment of a
service with an external TCP API which would need to be protected from
public access as well (might be easy depending on the serverless runtime
provider).

Last night I was wondering if there are any other common interfaces
available in serverless environments and noticed that all products by AWS
(Lambda) and GCP (Functions, Run) at least provide the option to handle log
streams, sometimes even log files on disk. I'm currently thinking about
experimenting with an approach where containers log metrics to stdout /
some file, get picked up by the serverless runtime and written to some log
stream. Another service "loggateway" (or otherwise named) would then stream
the logs, aggregate them and either expose them on the common /metrics
endpoint or push them with remote write right away to a Prometheus instance
hosted somewhere (like Grafana Cloud).
My hopes are that the latency impact of logging a dozen metrics per request
should be neglectable especially compared to TCP pushing. There are a lot
of open questions about the log format, how to handle metric metadata
(without logging it all the time), and HA deployment of the log aggregation
service. Furthermore this approach requires some support by the client
libraries (I think only the Ruby client supports custom data stores).

Besides the implementation details, one major downside would be the
pollution of the common log stream if the runtime provider doesn't support
separate log streams (AWS Lambda only supports stdout/stderr I think).
Anything else I'm missing which would make this idea infeasible?

> > I think the main problem appears if those FaaS runtimes are short-living
> > workloads that automatically spins up only to run some functions (batch
> > jobs). In some way, this is then a problem of short-living jobs and the
> > design of those workloads.
> >
> > For those short-living jobs, we again see users try to use the push
> model.
> > I think there is room to either streamline those initiatives OR propose
> > an alternative. A quick idea, yolo... why not killing the job after the
> > first successful scrape (detecting usage on /metric path)?
>
> Ugh, that doesn't sound right. I think this problem should be solved
> within the FaaS runtime in the way they prefer. Cloud providers need
> billing in any case (they want to make money after all), so they have
> already solved reliably metrics collection for that. They just need to
> hook in a simple exporter to present Prometheus metrics. See how
> OpenFaaS has done it. Knative seems to have gone down the OTel path,
> but that could be seen as an implementation detail. If they in the end
> expose a /metrics endpoint with the desired metrics for Prometheus to
> scrape, all is good. It's just a terribly overengineered exporter,
> effectively. (o;
>
> --
> Björn Rabenstein
> [PGP-ID] 0x851C3DA17D748D03
> [email] [email protected]
>
> --
> You received this message because you are subscribed to the Google Groups
> "Prometheus Developers" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/prometheus-developers/20210618221656.GS3670%40jahnn
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-developers/CAChBsdBpg7KJ%2Bu%3DNAaLeL34c6OA3JiimaXxY0PfS3Yqj1e5MEA%40mail.gmail.com.

Re: [prometheus-developers] Requirements / Best Practices to use Prometheus Metrics for Serverless environments

Reply via email to