Hi All, Prometheus has seen the fashion shifting from on-premise to clouds, monoliths to microservices, virtual machines to containers etc. Prometheus has proven to be successful for users in all those scenarios. Let's now talk about FaaS/Serverless. (Let's leave other buzzwords - blockchain/AI for later 🙈).
I would love to start a discussion around the usage of Prometheus Metrics on Serverless environments. I wonder if, from the Prometheus dev point of view, we can implement/integrate anything better, document or explain more etc. (: *In this thread, I am specifically looking for: * * Existing best practices for using Prometheus for gathering metrics from Serverless/FaaS platforms and functions * Specific gaps and limitation users might have in these scenarios * Existing success stories? * Ideas for improvements. *Action Item: Feel free to respond if you have any input on those!* Past discussions: * Fair suggestion to use cloud exporters for FaaS cases <https://groups.google.com/g/prometheus-users/c/_WSnJtn_TMw/m/f9Dh2cRkAQAJ> * Suggestion to use event aggregation proxy <https://github.com/weaveworks/prom-aggregation-gateway> * Pushgateway improvements <https://groups.google.com/g/prometheus-users/c/sm5qOrsVY80/m/nSfbzHd9AgAJ> for serverless cases *My thoughts:* IMO the FaaS function should be like function in any other full pledge application/pod. You programmatically increment common metric for your aggregated view (e.g overall number of errors). Trying to switch to a push model for this case, sounds like an unnecessary complication because, in the end, those functions are running in the common, longer living context (e.g FaaS runtime). This runtime should give programmatic APIs to use custom metrics like it's possible in a normal app when your function has local variables (e.g *prometheus.CounterVec) to use. In fact, this is what AWS Lambda allows <https://aws.amazon.com/blogs/compute/operating-lambda-logging-and-custom-metrics/>and there are exporters to get that data <https://sysdig.com/blog/monitor-aws-lambda-prometheus/>into Prometheus. We see users attempting to switch to the push model. I just wonder if for FaaS functions this really makes sense. If you init the TCP connection and use remote write, OM push, pushgateway API / Otel/OpenCensus to push metric, you take enormous latency hit to spin up a new TCP connection just for that. This might be already too slow for FaaS. If you do this asynchronously on Faas platform, you need to care about discovery/backoffs/persistent buffer/auth and all pains of push model + some aggregation proxy like Pushgateway/Aggregation gateway or OTel collector to get this data to Prometheus (BTW this is what knative is recommending <https://knative.dev/docs/install/collecting-metrics/>). Equally, one could just expose those metrics on /metrics endpoint and drop all of this complexity (or run exporter if FaaS is in the cloud, like Lambda/Google Run). I think the main problem appears if those FaaS runtimes are short-living workloads that automatically spins up only to run some functions (batch jobs). In some way, this is then a problem of short-living jobs and the design of those workloads. For those short-living jobs, we again see users try to use the push model. I think there is room to either streamline those initiatives OR propose an alternative. A quick idea, yolo... why not killing the job after the first successful scrape (detecting usage on /metric path)? Kind Regards, Bartek Płotka (@bwplotka) -- You received this message because you are subscribed to the Google Groups "Prometheus Developers" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/CAMssQwasGdCYMCvbh21vgcdZWiNz0FhVfNzdPQLruHBQVKEWrw%40mail.gmail.com.

