[ 
https://issues.apache.org/jira/browse/GEODE-8357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kirk Lund updated GEODE-8357:
-----------------------------
    Summary: Exhausting the high priority message thread pool can result in 
deadlock  (was: Exhausting the high priority message pool can result in 
deadlock)

> Exhausting the high priority message thread pool can result in deadlock
> -----------------------------------------------------------------------
>
>                 Key: GEODE-8357
>                 URL: https://issues.apache.org/jira/browse/GEODE-8357
>             Project: Geode
>          Issue Type: Bug
>          Components: messaging
>    Affects Versions: 1.0.0-incubating, 1.2.0, 1.3.0, 1.4.0, 1.5.0, 1.6.0, 
> 1.7.0, 1.8.0, 1.9.0, 1.10.0, 1.11.0, 1.12.0
>            Reporter: Kirk Lund
>            Assignee: Kirk Lund
>            Priority: Major
>              Labels: GeodeOperationAPI
>
> The system property "DistributionManager.MAX_THREADS" default to 100:
> {noformat}
> int MAX_THREADS = Integer.getInteger("DistributionManager.MAX_THREADS", 100);
> {noformat}
> The system property used to be defined in geode-core 
> ClusterDistributionManager and has moved to geode-core OperationExecutors.
> The value is used to limit ClusterOperationExecutors threadPool and 
> highPriorityPool:
> {noformat}
> threadPool =
>     CoreLoggingExecutors.newThreadPoolWithFeedStatistics("Pooled Message 
> Processor ",
>         thread -> stats.incProcessingThreadStarts(), this::doProcessingThread,
>         MAX_THREADS, stats.getNormalPoolHelper(), threadMonitor,
>         INCOMING_QUEUE_LIMIT, stats.getOverflowQueueHelper());
> highPriorityPool = CoreLoggingExecutors.newThreadPoolWithFeedStatistics(
>     "Pooled High Priority Message Processor ",
>     thread -> stats.incHighPriorityThreadStarts(), this::doHighPriorityThread,
>     MAX_THREADS, stats.getHighPriorityPoolHelper(), threadMonitor,
>     INCOMING_QUEUE_LIMIT, stats.getHighPriorityQueueHelper());
> {noformat}
> I have seen server startup hang when recovering lots of expired entries from 
> disk while using PDX. The hang looks like a dlock request for the PDX lock is 
> not receiving a response. Checking the value for the 
> distributionStats#highPriorityQueueSize statistic (in VSD) shows the value 
> maxed out and never dropping.
> The dlock response granting the PDX lock is stuck in the highPriorityQueue 
> because there are no more highPriorityQueue threads available to process the 
> response. All of the highPriorityQueue thread stack dumps show tasks such as 
> recovering bucket from disk are blocked waiting for the PDX lock.
> Several changes could improve this situation, either in conjunction or 
> individually:
> # improve observability to enable support to identify that this situation has 
> occurred
> # automatically identify this situation and warn the user with a log statement
> # automatically prevent this situation
> # identify the messages that are prone to causing deadlocks and move them to 
> a dedicated thread pool with a higher limit



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to