[ 
https://issues.apache.org/jira/browse/GEODE-5349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16528298#comment-16528298
 ] 

ASF subversion and git services commented on GEODE-5349:
--------------------------------------------------------

Commit dfafad79c2700a0356faf2eed1fe9ff4248e4ed4 in geode's branch 
refs/heads/develop from [~bschuchardt]
[ https://gitbox.apache.org/repos/asf?p=geode.git;h=dfafad7 ]

GEODE-5349 State-flush operation may exit early allowing for cache inconsistency

Removed the ability for this method to exit without the operation count
falling to zero.  Instead it issues a fatal-level log message, which
translates into a severe-level alert for operators.  This can help tech
support know which server a customer should terminate in order to break
a distributed deadlock.

I also added an info-level message that is issued if a warning/fatal message
has been issued noting that the wait has completed.  This parallels what
we do in ReplyProcessor21 if we've issued a warning that a cache-op response
hasn't been received within the ack-wait-threshold period.

This closes #2083


> State-flush operation may terminate waiting for current operations, allowing 
> for cache inconsistency
> ----------------------------------------------------------------------------------------------------
>
>                 Key: GEODE-5349
>                 URL: https://issues.apache.org/jira/browse/GEODE-5349
>             Project: Geode
>          Issue Type: Bug
>          Components: regions
>            Reporter: Bruce Schuchardt
>            Assignee: Bruce Schuchardt
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> The state-flush operation relies in part on 
> DistributionAdvisor.waitForCurrentOperations() to stall until in-process 
> replication efforts have written their messages to communication channels.  
> This method currently has a self-imposed time limit of 
> (2*ack-wait-threshold)-1 seconds, which defaults to 29 seconds.  If a cache 
> operation, say a transaction commit, happens to take longer than this the 
> waitForCurrentOperations() method will terminate early, possibly allowing a 
> new copy of a region to miss the changes contained in that cache operation.
> We should remove the timeout in waitForCurrentOperations and rigorously test 
> the change.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to