That's the point exactly to have a single very thin and generic mechanism to cover all threads/threads pool. Nothing is specific in this solution. Regards
-----Original Message----- From: Jason Huynh [jhu...@pivotal.io] Received: Wednesday, 21 Feb 2018, 20:54 To: dev@geode.apache.org [dev@geode.apache.org] CC: u...@geode.apache.org [u...@geode.apache.org] Subject: Re: [Proposal] Thread monitoring mechanism I am assuming this would be for all thread/thread pools and not specific to Function threads. I wonder what the impact would be for put/get operations or are we going to target specific operations. On Tue, Feb 20, 2018 at 1:04 AM Gregory Vortman <gregory.vort...@amdocs.com<mailto:gregory.vort...@amdocs.com>> wrote: Hello team, One of the most severe issues hitting our real time application is thread stuck for multiple reasons, such as long lasting locks, deadlocks, threads which wait for reply forever in case of packet drop issue etc... Such kind of stuck are under Radar of the existing system health check methods. In mission critical applications, this will be resulted as an immediate outage. As a short we are implementing kind of internal watch dog mechanism for stuck detector: There is a registration object Function executor having start/end hooks to register/unregister the thread via the registration object Customized Monitoring scheduled thread is spawned on startup. The thread to wake up every N seconds, to scan the registration map and to detect unregistered threads for a long time (configurable). Once such threads has been detected, process stack is taken and thread stack statistic metric is provided. This helps us to monitor, detect and take fast decision about the action which should be taken - usually it is member bounce decision (consistency issue is possible, in our case it is better than deny of service). The above solution is not touching GEODE core code, but implemented in boundaries of customized code only. I would like to raise a proposal to introduce a long term generic thread monitoring mechanism, to detect threads which are stuck for any reason. To maintain a monitoring object having a start/end methods to be invoked similarly to FunctionStats.startFunctionExecution and FunctionStats.endFunctionExecution. Your feedback would be appreciated Thank you for cooperation. Best regards! Gregory Vortman This message and the information contained herein is proprietary and confidential and subject to the Amdocs policy statement, you may review at https://www.amdocs.com/about/email-disclaimer <https://www.amdocs.com/about/email-disclaimer> This message and the information contained herein is proprietary and confidential and subject to the Amdocs policy statement, you may review at https://www.amdocs.com/about/email-disclaimer <https://www.amdocs.com/about/email-disclaimer>