[ https://issues.apache.org/jira/browse/GEODE-5349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16528298#comment-16528298 ]
ASF subversion and git services commented on GEODE-5349: -------------------------------------------------------- Commit dfafad79c2700a0356faf2eed1fe9ff4248e4ed4 in geode's branch refs/heads/develop from [~bschuchardt] [ https://gitbox.apache.org/repos/asf?p=geode.git;h=dfafad7 ] GEODE-5349 State-flush operation may exit early allowing for cache inconsistency Removed the ability for this method to exit without the operation count falling to zero. Instead it issues a fatal-level log message, which translates into a severe-level alert for operators. This can help tech support know which server a customer should terminate in order to break a distributed deadlock. I also added an info-level message that is issued if a warning/fatal message has been issued noting that the wait has completed. This parallels what we do in ReplyProcessor21 if we've issued a warning that a cache-op response hasn't been received within the ack-wait-threshold period. This closes #2083 > State-flush operation may terminate waiting for current operations, allowing > for cache inconsistency > ---------------------------------------------------------------------------------------------------- > > Key: GEODE-5349 > URL: https://issues.apache.org/jira/browse/GEODE-5349 > Project: Geode > Issue Type: Bug > Components: regions > Reporter: Bruce Schuchardt > Assignee: Bruce Schuchardt > Priority: Major > Labels: pull-request-available > Time Spent: 1h 50m > Remaining Estimate: 0h > > The state-flush operation relies in part on > DistributionAdvisor.waitForCurrentOperations() to stall until in-process > replication efforts have written their messages to communication channels. > This method currently has a self-imposed time limit of > (2*ack-wait-threshold)-1 seconds, which defaults to 29 seconds. If a cache > operation, say a transaction commit, happens to take longer than this the > waitForCurrentOperations() method will terminate early, possibly allowing a > new copy of a region to miss the changes contained in that cache operation. > We should remove the timeout in waitForCurrentOperations and rigorously test > the change. -- This message was sent by Atlassian JIRA (v7.6.3#76005)