Hello Brian, First of all, thank you for the proposal. My initial thought is that since this functionality would not be used by Prometheus or a Prometheus related component that it is beyond the scope of a Prometheus client library. I do like the idea of being able to use metrics as a signal for the result of a /healthz endpoint though, so if it is challenging to get the current value of a metric that is something I would consider improving.
Thanks again for the proposal and I am curious what others think as well, Chris On Mon, Sep 4, 2023 at 9:14 AM 'Brian Horakh' via Prometheus Developers < [email protected]> wrote: > I've opened a feature/issue that I plan to implement for my org and submit > upstream to the Prometheus client for python. > > The link is here: > https://github.com/prometheus/client_python/issues/953 > > The proposal is to equip the prometheus client with the ability to > concurrently respond with an HTTP 200 /healthz page that can be *easily* > integrated into control planes (ex: AWS ECS & K8s both use z-pages) .. > > Control planes use z-pages (pioneered by Google, but widely adopted by > most load balancers) to determine if an application is alive/functioning > properly based on the HTTP response code. If an application fails to > return an HTTP 200 after a configured amount of intervals the container is > terminated and a new container is spun up. The python client for > prometheus has it's own webserver internally, so I'm proposing to implement > z-pages capability in the client. > > To be clear: I'm NOT proposing adding functionality to Prometheus. As > far as Prometheus core is concerned these are nothing special "Counters". > > Github user: roidelapluie <https://github.com/roidelapluie> requested I > submit my proposal to the broader Prometheus community for discussion & > feedback, which is welcome! > > My proposed design: > The current implementation for Prometheus Counter is the super class, and > the proposed "HealthzCounter" will fully inherit all those capabilities & > behaviors. > > Application develops implementing east/west telemetry in their > applications can then place HealthzCounters at one or more critical points > in the codepath to track if an application is working (i.e. is the main > loop running or blocked). > For applications using python asyncio then it would be appropriate to > implement one HealthzCounter per critical event loop. > > The behavior *if* for example, an MQ or DB client disconnects and > doesn't/can't reconnect, it's either running an error path or a success > path, and either can be attributed to a plurality of potential underlying > reasons .. the application can be easily terminated in a standard way by > the container orchestrator. > > The intention is to place these counters into the critical codepath, the > HealthzCounter requires a heartbeat, thus after an interval it trips a > deadman switch) .. this would then cause the /healthz url to return a non > HTTP-200 informing the clustering orchestration software to perform a Roy > from IT crowd solution ("hello IT dept., have you tried turning it off an > on again!?") > > The business case: > My organization is in the process of implementing all our applications > with east/west metrics and I have gotten tentative approval to develop this > feature and upstream the work. > > While it would be better to fix the bugs in the app, the reset itself is > often the first "routine" step in troubleshooting. The control plane will > keep a log of the resets, etc. because when an application is restarted > it's counter is also reset to zero (so this makes tracking the entire > maneuver quite easy and obvious in a tool like grafana using promql) > > I will be faster to respond on the github issue, if you don't mind > responding there with feedback or ideas, but will try to keep an eye on > this group as well for the next few days. > > Also since this is my first time posting to the prometheus devs list -- > need to say: thank you for what you do & what you have done!! > > Cheers, > > -Brian Horakh > Software Engineer > Habitat.Energy Australia > > > > -- > You received this message because you are subscribed to the Google Groups > "Prometheus Developers" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/prometheus-developers/b9186e94-8433-4081-a87c-26479bbedbccn%40googlegroups.com > <https://groups.google.com/d/msgid/prometheus-developers/b9186e94-8433-4081-a87c-26479bbedbccn%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "Prometheus Developers" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/CANVFovVNk_HDCRQxY5feoUA0RZDYtqrHmeCTJcgvc7ELcScmWg%40mail.gmail.com.

