[prometheus-developers] python client support for z-pages

'Brian Horakh' via Prometheus Developers Mon, 04 Sep 2023 08:14:25 -0700

I've opened a feature/issue that I plan to implement for my org and submit 
upstream to the Prometheus client for python.

The link is here:
https://github.com/prometheus/client_python/issues/953

The proposal is to equip the prometheus client with the ability to
concurrently respond with an HTTP 200 /healthz page that can be *easily*
integrated into control planes (ex: AWS ECS & K8s both use z-pages) ..

Control planes use z-pages (pioneered by Google, but widely adopted by most
load balancers) to determine if an application is alive/functioning
properly based on the HTTP response code. If an application fails to
return an HTTP 200 after a configured amount of intervals the container is
terminated and a new container is spun up. The python client for
prometheus has it's own webserver internally, so I'm proposing to implement
z-pages capability in the client.

To be clear: I'm NOT proposing adding functionality to Prometheus. As far
as Prometheus core is concerned these are nothing special "Counters".

Github user: roidelapluie <https://github.com/roidelapluie> requested I
submit my proposal to the broader Prometheus community for discussion &
feedback, which is welcome!

My proposed design:
The current implementation for Prometheus Counter is the super class, and
the proposed "HealthzCounter" will fully inherit all those capabilities &
behaviors.

Application develops implementing east/west telemetry in their applications
can then place HealthzCounters at one or more critical points in the
codepath to track if an application is working (i.e. is the main loop
running or blocked).
For applications using python asyncio then it would be appropriate to
implement one HealthzCounter per critical event loop.

The behavior *if* for example, an MQ or DB client disconnects and
doesn't/can't reconnect, it's either running an error path or a success
path, and either can be attributed to a plurality of potential underlying
reasons .. the application can be easily terminated in a standard way by
the container orchestrator.

The intention is to place these counters into the critical codepath, the
HealthzCounter requires a heartbeat, thus after an interval it trips a
deadman switch) .. this would then cause the /healthz url to return a non
HTTP-200 informing the clustering orchestration software to perform a Roy
from IT crowd solution ("hello IT dept., have you tried turning it off an
on again!?")

The business case:
My organization is in the process of implementing all our applications with
east/west metrics and I have gotten tentative approval to develop this
feature and upstream the work.

While it would be better to fix the bugs in the app, the reset itself is
often the first "routine" step in troubleshooting. The control plane will
keep a log of the resets, etc. because when an application is restarted
it's counter is also reset to zero (so this makes tracking the entire
maneuver quite easy and obvious in a tool like grafana using promql)

I will be faster to respond on the github issue, if you don't mind
responding there with feedback or ideas, but will try to keep an eye on
this group as well for the next few days.

Also since this is my first time posting to the prometheus devs list --
need to say: thank you for what you do & what you have done!!

Cheers,

-Brian Horakh
Software Engineer
Habitat.Energy Australia

--
You received this message because you are subscribed to the Google Groups
"Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-developers/b9186e94-8433-4081-a87c-26479bbedbccn%40googlegroups.com.

[prometheus-developers] python client support for z-pages

Reply via email to