Hello team,
One of the most severe issues hitting our real time application is thread stuck
for multiple reasons, such as long lasting locks, deadlocks, threads which wait
for reply forever in case of packet drop issue etc...
Such kind of stuck are under Radar of the existing system health check methods.
In mission critical applications, this will be resulted as an immediate outage.
As a short we are implementing kind of internal watch dog mechanism for stuck
detector:
There is a registration object
Function executor having start/end hooks to register/unregister
the thread via the registration object
Customized Monitoring scheduled thread is spawned on startup. The thread to
wake up every N seconds, to scan the registration map and to detect
unregistered threads for a long time (configurable).
Once such threads has been detected, process stack is taken and thread stack
statistic metric is provided.
This helps us to monitor, detect and take fast decision about the action which
should be taken - usually it is member bounce decision (consistency issue is
possible, in our case it is better than deny of service).
The above solution is not touching GEODE core code, but implemented in
boundaries of customized code only.
I would like to raise a proposal to introduce a long term generic thread
monitoring mechanism, to detect threads which are stuck for any reason.
To maintain a monitoring object having a start/end methods to be invoked
similarly to FunctionStats.startFunctionExecution and
FunctionStats.endFunctionExecution.
Your feedback would be appreciated
Thank you for cooperation.
Best regards!
Gregory Vortman
This message and the information contained herein is proprietary and
confidential and subject to the Amdocs policy statement,
you may review at https://www.amdocs.com/about/email-disclaimer
<https://www.amdocs.com/about/email-disclaimer>