Thanks for bringing up this topic Bartek and your great insights Björn! On Sat, Jun 19, 2021 at 12:16 AM Bjoern Rabenstein <[email protected]> wrote:
> On 15.06.21 20:59, Bartłomiej Płotka wrote: > > > > Let's now talk about FaaS/Serverless. > > Excellent! That's my 2nd favorite topic after histograms. (And while I > provably talked about histograms as my favorite topic since early > 2015, I have only started to talk about FaaS/Serverless as an > important gap to fill in the Prometheus story since 2018.) > > I think "true FaaS" means that the function calls are > lightweight. The additional overhead of sending anything over the > networks defeats that purpose. So similar to what has been said > before, and what Bartek has already nicely worked out, I think the > metrics have to be managed by the FaaS runtime, in the same path as > billing is managed. > > And that's, of course, what cloud providers are doing, and it's also a > formidable way of locking their customers into their own metrics and > monitoring system. > > And that's in turn precisely where I think Prometheus can use its > weight. Prometheus has already proven that cloud providers can > essentially not get away with ignoring it, and even halfhearted > integrations won't be enough. With more or less native Prometheus > support by cloud providers, it might actually just require a small > step to come to some convention how to collect and present FaaS > metrics in a "Promethean" way. If all cloud providers do it the same > way, the lock-in is gone. > > I think it would be very valuable to study what OpenFaaS has already > done: https://docs.openfaas.com/architecture/metrics/ > > In the simplest case, we could just say: Please, dear cloud providers, > please expose exactly the same metrics for general benefit. If there > is anything to improve with the OpenFaaS approach, I'm sure they will > be delighted to get help. (Spontaneously, I'm missing a way to define > custom metrics, e.g. how many records a function call has processed.) > I think it's a great idea to open the discussion with the big cloud providers about an open runtime integration for metrics. Maybe they're more open about this than I expect. My fear is that this won't really lead to any substantial improvement, as the vendor lock-in seems to be quite desired judging from my personal experience with cloud providers. > > * Suggestion to use event aggregation proxy > > <https://github.com/weaveworks/prom-aggregation-gateway> > > * Pushgateway improvements > > < > https://groups.google.com/g/prometheus-users/c/sm5qOrsVY80/m/nSfbzHd9AgAJ> > for > > serverless cases > > Despite all of what I said above, I think there _are_ quite a few user > of FaaS who have fairly heavy-weight function calls. For them, pushing > counter increments etc. via the network might actually be more > convenient than funneling metrics through the FaaS runtime. This is > then just another use-case of the "distributed counter" idea, which > the Pushgateway quite prominently is not catering for. As discussed > in the thread linked above and at countless other places, I strongly > recommend to not shoehorn the Pushgateway into this use-case, but > create a separate project for it, which would be designed from the > beginning for this use-case. Perhaps > weaveworks/prom-aggregation-gateway is just that. I haven't studied it > in detail yet. In a way, we need "statsd done right". Again, I would > suggest to look what others have already done. For example, there are > tons of statsd users out there. What have they done in the last years > to overcome the known shortcomings? Perhaps statsd instrumentation and > the Prometheus statsd exporter just needs a bit of development in that > way to make it a viable solution. > First of all, I wonder if there is really any difference in terms of heavy-weight/light-weight classification of serverless / FaaS in contrast to traditional deployment styles. Personally the reason I chose a serverless runtime (GCP Cloud Run) for my application layer is just in order to focus on business feature development. The runtime manages container lifecycles and we're only paying for the time containers serve traffic. I could deploy the exact same Docker container outside a serverless environment as well. My needs are still the same though: I want to instrument the various aspects of the service and its many endpoints, both with common request related metrics as well as custom metrics. The problem I face is the fundamental mismatch of Prometheus' pull architecture and the serverless runtime which doesn't even allow me to see individual container instances. The StatsD / push-over-network approach has some serious latency impact as you both highlighted already. Additionally, it requires the deployment of a service with an external TCP API which would need to be protected from public access as well (might be easy depending on the serverless runtime provider). Last night I was wondering if there are any other common interfaces available in serverless environments and noticed that all products by AWS (Lambda) and GCP (Functions, Run) at least provide the option to handle log streams, sometimes even log files on disk. I'm currently thinking about experimenting with an approach where containers log metrics to stdout / some file, get picked up by the serverless runtime and written to some log stream. Another service "loggateway" (or otherwise named) would then stream the logs, aggregate them and either expose them on the common /metrics endpoint or push them with remote write right away to a Prometheus instance hosted somewhere (like Grafana Cloud). My hopes are that the latency impact of logging a dozen metrics per request should be neglectable especially compared to TCP pushing. There are a lot of open questions about the log format, how to handle metric metadata (without logging it all the time), and HA deployment of the log aggregation service. Furthermore this approach requires some support by the client libraries (I think only the Ruby client supports custom data stores). Besides the implementation details, one major downside would be the pollution of the common log stream if the runtime provider doesn't support separate log streams (AWS Lambda only supports stdout/stderr I think). Anything else I'm missing which would make this idea infeasible? > > I think the main problem appears if those FaaS runtimes are short-living > > workloads that automatically spins up only to run some functions (batch > > jobs). In some way, this is then a problem of short-living jobs and the > > design of those workloads. > > > > For those short-living jobs, we again see users try to use the push > model. > > I think there is room to either streamline those initiatives OR propose > > an alternative. A quick idea, yolo... why not killing the job after the > > first successful scrape (detecting usage on /metric path)? > > Ugh, that doesn't sound right. I think this problem should be solved > within the FaaS runtime in the way they prefer. Cloud providers need > billing in any case (they want to make money after all), so they have > already solved reliably metrics collection for that. They just need to > hook in a simple exporter to present Prometheus metrics. See how > OpenFaaS has done it. Knative seems to have gone down the OTel path, > but that could be seen as an implementation detail. If they in the end > expose a /metrics endpoint with the desired metrics for Prometheus to > scrape, all is good. It's just a terribly overengineered exporter, > effectively. (o; > > -- > Björn Rabenstein > [PGP-ID] 0x851C3DA17D748D03 > [email] [email protected] > > -- > You received this message because you are subscribed to the Google Groups > "Prometheus Developers" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/prometheus-developers/20210618221656.GS3670%40jahnn > . > -- You received this message because you are subscribed to the Google Groups "Prometheus Developers" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/CAChBsdBpg7KJ%2Bu%3DNAaLeL34c6OA3JiimaXxY0PfS3Yqj1e5MEA%40mail.gmail.com.

