[ https://issues.apache.org/jira/browse/GEODE-8357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Kirk Lund updated GEODE-8357: ----------------------------- Summary: Exhausting the high priority message thread pool can result in deadlock (was: Exhausting the high priority message pool can result in deadlock) > Exhausting the high priority message thread pool can result in deadlock > ----------------------------------------------------------------------- > > Key: GEODE-8357 > URL: https://issues.apache.org/jira/browse/GEODE-8357 > Project: Geode > Issue Type: Bug > Components: messaging > Affects Versions: 1.0.0-incubating, 1.2.0, 1.3.0, 1.4.0, 1.5.0, 1.6.0, > 1.7.0, 1.8.0, 1.9.0, 1.10.0, 1.11.0, 1.12.0 > Reporter: Kirk Lund > Assignee: Kirk Lund > Priority: Major > Labels: GeodeOperationAPI > > The system property "DistributionManager.MAX_THREADS" default to 100: > {noformat} > int MAX_THREADS = Integer.getInteger("DistributionManager.MAX_THREADS", 100); > {noformat} > The system property used to be defined in geode-core > ClusterDistributionManager and has moved to geode-core OperationExecutors. > The value is used to limit ClusterOperationExecutors threadPool and > highPriorityPool: > {noformat} > threadPool = > CoreLoggingExecutors.newThreadPoolWithFeedStatistics("Pooled Message > Processor ", > thread -> stats.incProcessingThreadStarts(), this::doProcessingThread, > MAX_THREADS, stats.getNormalPoolHelper(), threadMonitor, > INCOMING_QUEUE_LIMIT, stats.getOverflowQueueHelper()); > highPriorityPool = CoreLoggingExecutors.newThreadPoolWithFeedStatistics( > "Pooled High Priority Message Processor ", > thread -> stats.incHighPriorityThreadStarts(), this::doHighPriorityThread, > MAX_THREADS, stats.getHighPriorityPoolHelper(), threadMonitor, > INCOMING_QUEUE_LIMIT, stats.getHighPriorityQueueHelper()); > {noformat} > I have seen server startup hang when recovering lots of expired entries from > disk while using PDX. The hang looks like a dlock request for the PDX lock is > not receiving a response. Checking the value for the > distributionStats#highPriorityQueueSize statistic (in VSD) shows the value > maxed out and never dropping. > The dlock response granting the PDX lock is stuck in the highPriorityQueue > because there are no more highPriorityQueue threads available to process the > response. All of the highPriorityQueue thread stack dumps show tasks such as > recovering bucket from disk are blocked waiting for the PDX lock. > Several changes could improve this situation, either in conjunction or > individually: > # improve observability to enable support to identify that this situation has > occurred > # automatically identify this situation and warn the user with a log statement > # automatically prevent this situation > # identify the messages that are prone to causing deadlocks and move them to > a dedicated thread pool with a higher limit -- This message was sent by Atlassian Jira (v8.3.4#803005)