Hi All,

Prometheus has seen the fashion shifting from on-premise to clouds,
monoliths to microservices, virtual machines to containers etc. Prometheus
has proven to be successful for users in all those scenarios. Let's
now talk about FaaS/Serverless. (Let's leave other buzzwords -
blockchain/AI for later 🙈).

I would love to start a discussion around the usage of Prometheus Metrics
on Serverless environments. I wonder if, from the Prometheus dev point of
view, we can implement/integrate anything better, document or explain more
etc. (:

*In this thread, I am specifically looking for: *

* Existing best practices for using Prometheus for gathering metrics from
Serverless/FaaS platforms and functions
* Specific gaps and limitation users might have in these scenarios
* Existing success stories?
* Ideas for improvements.

*Action Item: Feel free to respond if you have any input on those!*

Past discussions:
* Fair suggestion to use cloud exporters for FaaS cases
<https://groups.google.com/g/prometheus-users/c/_WSnJtn_TMw/m/f9Dh2cRkAQAJ>
* Suggestion to use event aggregation proxy
<https://github.com/weaveworks/prom-aggregation-gateway>
* Pushgateway improvements
<https://groups.google.com/g/prometheus-users/c/sm5qOrsVY80/m/nSfbzHd9AgAJ> for
serverless cases

*My thoughts:*

IMO the FaaS function should be like function in any other full pledge
application/pod. You programmatically increment common metric for your
aggregated view (e.g overall number of errors).

Trying to switch to a push model for this case, sounds like an unnecessary
complication because, in the end, those functions are running in the
common, longer living context (e.g FaaS runtime). This runtime should give
programmatic APIs to use custom metrics like it's possible in a normal
app when your function has local variables (e.g *prometheus.CounterVec) to
use.

In fact, this is what AWS Lambda allows
<https://aws.amazon.com/blogs/compute/operating-lambda-logging-and-custom-metrics/>and
there are exporters to get that data
<https://sysdig.com/blog/monitor-aws-lambda-prometheus/>into Prometheus.

We see users attempting to switch to the push model. I just wonder if
for FaaS functions this really makes sense.

If you init the TCP connection and use remote write, OM push, pushgateway
API / Otel/OpenCensus to push metric, you take enormous latency hit to spin
up a new TCP connection just for that. This might be already too slow for
FaaS. If you do this asynchronously on Faas platform, you need to care
about discovery/backoffs/persistent buffer/auth and all pains of push model
+ some aggregation proxy like Pushgateway/Aggregation gateway or OTel
collector to get this data to Prometheus (BTW this is what knative is
recommending <https://knative.dev/docs/install/collecting-metrics/>).
Equally, one could just expose those metrics on /metrics endpoint and drop
all of this complexity (or run exporter if FaaS is in the cloud, like
Lambda/Google Run).

I think the main problem appears if those FaaS runtimes are short-living
workloads that automatically spins up only to run some functions (batch
jobs). In some way, this is then a problem of short-living jobs and the
design of those workloads.

For those short-living jobs, we again see users try to use the push model.
I think there is room to either streamline those initiatives OR propose
an alternative. A quick idea, yolo... why not killing the job after the
first successful scrape (detecting usage on /metric path)?

Kind Regards,
Bartek Płotka (@bwplotka)

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-developers/CAMssQwasGdCYMCvbh21vgcdZWiNz0FhVfNzdPQLruHBQVKEWrw%40mail.gmail.com.

Reply via email to