Hello Brian,

First of all, thank you for the proposal. My initial thought is that since
this functionality would not be used by Prometheus or a Prometheus related
component that it is beyond the scope of a Prometheus client library. I do
like the idea of being able to use metrics as a signal for the result of a
/healthz endpoint though, so if it is challenging to get the current value
of a metric that is something I would consider improving.

Thanks again for the proposal and I am curious what others think as well,
Chris

On Mon, Sep 4, 2023 at 9:14 AM 'Brian Horakh' via Prometheus Developers <
[email protected]> wrote:

> I've opened a feature/issue that I plan to implement for my org and submit
> upstream to the Prometheus client for python.
>
> The link is here:
> https://github.com/prometheus/client_python/issues/953
>
> The proposal is to equip the prometheus client with the ability to
> concurrently respond with an HTTP 200 /healthz page that can be *easily*
> integrated into control planes (ex: AWS ECS & K8s both use z-pages) ..
>
> Control planes use z-pages (pioneered by Google, but widely adopted by
> most load balancers) to determine if an application is alive/functioning
> properly based on the HTTP response code.   If an application fails to
> return an HTTP 200 after a configured amount of intervals the container is
> terminated and a new container is spun up.   The python client for
> prometheus has it's own webserver internally, so I'm proposing to implement
> z-pages capability in the client.
>
> To be clear:  I'm NOT proposing adding functionality to Prometheus.  As
> far as Prometheus core is concerned these are nothing special "Counters".
>
> Github user: roidelapluie <https://github.com/roidelapluie> requested I
> submit my proposal to the broader Prometheus community for discussion &
> feedback, which is welcome!
>
> My proposed design:
> The current implementation for Prometheus Counter is the super class, and
> the proposed "HealthzCounter" will fully inherit all those capabilities &
> behaviors.
>
> Application develops implementing east/west telemetry in their
> applications can then place HealthzCounters at one or more critical points
> in the codepath to track if an application is working (i.e. is the main
> loop running or blocked).
> For applications using python asyncio then it would be appropriate to
> implement one HealthzCounter per critical event loop.
>
> The behavior *if* for example, an MQ or DB client disconnects and
> doesn't/can't reconnect, it's either running an error path or a success
> path, and either can be attributed to a plurality of potential underlying
> reasons .. the application can be easily terminated in a standard way by
> the container orchestrator.
>
> The intention is to place these counters into the critical codepath, the
> HealthzCounter requires a heartbeat, thus after an interval it trips a
> deadman switch) .. this would then cause the /healthz url to return a non
> HTTP-200 informing the clustering orchestration software to perform a Roy
> from IT crowd solution ("hello IT dept., have you tried turning it off an
> on again!?")
>
> The business case:
> My organization is in the process of implementing all our applications
> with east/west metrics and I have gotten tentative approval to develop this
> feature and upstream the work.
>
> While it would be better to fix the bugs in the app, the reset itself is
> often the first "routine" step in troubleshooting.   The control plane will
> keep a log of the resets, etc. because when an application is restarted
> it's counter is also reset to zero (so this makes tracking the entire
> maneuver quite easy and obvious in a tool like grafana using promql)
>
> I will be faster to respond on the github issue, if you don't mind
> responding there with feedback or ideas, but will try to keep an eye on
> this group as well for the next few days.
>
> Also since this is my first time posting to the prometheus devs list --
> need to say: thank you for what you do & what you have done!!
>
> Cheers,
>
> -Brian Horakh
> Software Engineer
> Habitat.Energy Australia
>
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "Prometheus Developers" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/prometheus-developers/b9186e94-8433-4081-a87c-26479bbedbccn%40googlegroups.com
> <https://groups.google.com/d/msgid/prometheus-developers/b9186e94-8433-4081-a87c-26479bbedbccn%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-developers/CANVFovVNk_HDCRQxY5feoUA0RZDYtqrHmeCTJcgvc7ELcScmWg%40mail.gmail.com.

Reply via email to