I've opened a feature/issue that I plan to implement for my org and submit 
upstream to the Prometheus client for python.

The link is here:
https://github.com/prometheus/client_python/issues/953

The proposal is to equip the prometheus client with the ability to 
concurrently respond with an HTTP 200 /healthz page that can be *easily* 
integrated into control planes (ex: AWS ECS & K8s both use z-pages) .. 

Control planes use z-pages (pioneered by Google, but widely adopted by most 
load balancers) to determine if an application is alive/functioning 
properly based on the HTTP response code.   If an application fails to 
return an HTTP 200 after a configured amount of intervals the container is 
terminated and a new container is spun up.   The python client for 
prometheus has it's own webserver internally, so I'm proposing to implement 
z-pages capability in the client. 

To be clear:  I'm NOT proposing adding functionality to Prometheus.  As far 
as Prometheus core is concerned these are nothing special "Counters".

Github user: roidelapluie <https://github.com/roidelapluie> requested I 
submit my proposal to the broader Prometheus community for discussion & 
feedback, which is welcome! 

My proposed design:
The current implementation for Prometheus Counter is the super class, and 
the proposed "HealthzCounter" will fully inherit all those capabilities & 
behaviors. 

Application develops implementing east/west telemetry in their applications 
can then place HealthzCounters at one or more critical points in the 
codepath to track if an application is working (i.e. is the main loop 
running or blocked).  
For applications using python asyncio then it would be appropriate to 
implement one HealthzCounter per critical event loop.  

The behavior *if* for example, an MQ or DB client disconnects and 
doesn't/can't reconnect, it's either running an error path or a success 
path, and either can be attributed to a plurality of potential underlying 
reasons .. the application can be easily terminated in a standard way by 
the container orchestrator.  

The intention is to place these counters into the critical codepath, the 
HealthzCounter requires a heartbeat, thus after an interval it trips a 
deadman switch) .. this would then cause the /healthz url to return a non 
HTTP-200 informing the clustering orchestration software to perform a Roy 
from IT crowd solution ("hello IT dept., have you tried turning it off an 
on again!?")

The business case:
My organization is in the process of implementing all our applications with 
east/west metrics and I have gotten tentative approval to develop this 
feature and upstream the work.   

While it would be better to fix the bugs in the app, the reset itself is 
often the first "routine" step in troubleshooting.   The control plane will 
keep a log of the resets, etc. because when an application is restarted 
it's counter is also reset to zero (so this makes tracking the entire 
maneuver quite easy and obvious in a tool like grafana using promql)

I will be faster to respond on the github issue, if you don't mind 
responding there with feedback or ideas, but will try to keep an eye on 
this group as well for the next few days. 

Also since this is my first time posting to the prometheus devs list -- 
need to say: thank you for what you do & what you have done!! 

Cheers,

-Brian Horakh
Software Engineer
Habitat.Energy Australia



-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-developers/b9186e94-8433-4081-a87c-26479bbedbccn%40googlegroups.com.

Reply via email to