[ 
https://issues.apache.org/jira/browse/GEODE-8357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kirk Lund updated GEODE-8357:
-----------------------------
    Description: 
The system property "DistributionManager.MAX_THREADS" default to 100:
{noformat}
int MAX_THREADS = Integer.getInteger("DistributionManager.MAX_THREADS", 100);
{noformat}
The system property used to be defined in geode-core ClusterDistributionManager 
and has moved to geode-core OperationExecutors.

The value is used to limit ClusterOperationExecutors threadPool and 
highPriorityPool:
{noformat}
threadPool =
    CoreLoggingExecutors.newThreadPoolWithFeedStatistics("Pooled Message 
Processor ",
        thread -> stats.incProcessingThreadStarts(), this::doProcessingThread,
        MAX_THREADS, stats.getNormalPoolHelper(), threadMonitor,
        INCOMING_QUEUE_LIMIT, stats.getOverflowQueueHelper());

highPriorityPool = CoreLoggingExecutors.newThreadPoolWithFeedStatistics(
    "Pooled High Priority Message Processor ",
    thread -> stats.incHighPriorityThreadStarts(), this::doHighPriorityThread,
    MAX_THREADS, stats.getHighPriorityPoolHelper(), threadMonitor,
    INCOMING_QUEUE_LIMIT, stats.getHighPriorityQueueHelper());
{noformat}
I have seen server startup hang when recovering lots of expired entries from 
disk while using PDX. The hang looks like a dlock request for the PDX lock is 
not receiving a response. Checking the value for the 
distributionStats#highPriorityQueueSize statistic (in VSD) shows the value 
maxed out and never dropping.

The dlock response granting the PDX lock is stuck in the highPriorityQueue 
because there are no more highPriorityQueue threads available to process the 
response. All of the highPriorityQueue thread stack dumps show tasks such as 
recovering bucket from disk are blocked waiting for the PDX lock.

Several changes could improve this situation, either in conjunction or 
individually:
# improve observability to enable support to identify that this situation has 
occurred
# automatically identify this situation and warn the user with a log statement
# automatically prevent this situation
# identify the messages that are prone to causing deadlocks and move them to a 
dedicated thread pool with a higher limit

  was:
The system property "DistributionManager.MAX_THREADS" default to 100:
{noformat}
int MAX_THREADS = Integer.getInteger("DistributionManager.MAX_THREADS", 100);
{noformat}
The system property used to be defined in geode-core ClusterDistributionManager 
and has moved to geode-core OperationExecutors.

The value is used to limit ClusterOperationExecutors threadPool and 
highPriorityPool:
{noformat}
threadPool =
    CoreLoggingExecutors.newThreadPoolWithFeedStatistics("Pooled Message 
Processor ",
        thread -> stats.incProcessingThreadStarts(), this::doProcessingThread,
        MAX_THREADS, stats.getNormalPoolHelper(), threadMonitor,
        INCOMING_QUEUE_LIMIT, stats.getOverflowQueueHelper());

highPriorityPool = CoreLoggingExecutors.newThreadPoolWithFeedStatistics(
    "Pooled High Priority Message Processor ",
    thread -> stats.incHighPriorityThreadStarts(), this::doHighPriorityThread,
    MAX_THREADS, stats.getHighPriorityPoolHelper(), threadMonitor,
    INCOMING_QUEUE_LIMIT, stats.getHighPriorityQueueHelper());
{noformat}
I have seen server startup hang when recovering lots of expired entries from 
disk while using PDX. The hang looks like a dlock request for the PDX lock is 
not receiving a response. Checking the value for the 
distributionStats#highPriorityQueueSize statistic (in VSD) shows the value 
maxed out and never dropping.

The dlock response granting the PDX lock is stuck in the highPriorityQueue 
because there are no more highPriorityQueue threads available to process the 
response. All of the highPriorityQueue thread stack dumps show tasks such as 
recovering bucket from disk are blocked waiting for the PDX lock.

Several changes could improve this situation, either in conjunction or 
individually:
# improve observability to enable support to identify that this situation has 
occurred
# automatically identify this situation and warn the user with a log statement
# automatically prevent this situation


> Exhausting the high priority message pool can result in deadlock
> ----------------------------------------------------------------
>
>                 Key: GEODE-8357
>                 URL: https://issues.apache.org/jira/browse/GEODE-8357
>             Project: Geode
>          Issue Type: Bug
>          Components: messaging
>    Affects Versions: 1.0.0-incubating, 1.2.0, 1.3.0, 1.4.0, 1.5.0, 1.6.0, 
> 1.7.0, 1.8.0, 1.9.0, 1.10.0, 1.11.0, 1.12.0
>            Reporter: Kirk Lund
>            Assignee: Kirk Lund
>            Priority: Major
>              Labels: GeodeOperationAPI
>
> The system property "DistributionManager.MAX_THREADS" default to 100:
> {noformat}
> int MAX_THREADS = Integer.getInteger("DistributionManager.MAX_THREADS", 100);
> {noformat}
> The system property used to be defined in geode-core 
> ClusterDistributionManager and has moved to geode-core OperationExecutors.
> The value is used to limit ClusterOperationExecutors threadPool and 
> highPriorityPool:
> {noformat}
> threadPool =
>     CoreLoggingExecutors.newThreadPoolWithFeedStatistics("Pooled Message 
> Processor ",
>         thread -> stats.incProcessingThreadStarts(), this::doProcessingThread,
>         MAX_THREADS, stats.getNormalPoolHelper(), threadMonitor,
>         INCOMING_QUEUE_LIMIT, stats.getOverflowQueueHelper());
> highPriorityPool = CoreLoggingExecutors.newThreadPoolWithFeedStatistics(
>     "Pooled High Priority Message Processor ",
>     thread -> stats.incHighPriorityThreadStarts(), this::doHighPriorityThread,
>     MAX_THREADS, stats.getHighPriorityPoolHelper(), threadMonitor,
>     INCOMING_QUEUE_LIMIT, stats.getHighPriorityQueueHelper());
> {noformat}
> I have seen server startup hang when recovering lots of expired entries from 
> disk while using PDX. The hang looks like a dlock request for the PDX lock is 
> not receiving a response. Checking the value for the 
> distributionStats#highPriorityQueueSize statistic (in VSD) shows the value 
> maxed out and never dropping.
> The dlock response granting the PDX lock is stuck in the highPriorityQueue 
> because there are no more highPriorityQueue threads available to process the 
> response. All of the highPriorityQueue thread stack dumps show tasks such as 
> recovering bucket from disk are blocked waiting for the PDX lock.
> Several changes could improve this situation, either in conjunction or 
> individually:
> # improve observability to enable support to identify that this situation has 
> occurred
> # automatically identify this situation and warn the user with a log statement
> # automatically prevent this situation
> # identify the messages that are prone to causing deadlocks and move them to 
> a dedicated thread pool with a higher limit



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to