Bruce Schuchardt created GEODE-2865:
---------------------------------------
             Summary: data loss in initial-image replication with multicast
                 Key: GEODE-2865
                 URL: https://issues.apache.org/jira/browse/GEODE-2865
             Project: Geode
          Issue Type: Bug
          Components: messaging
            Reporter: Bruce Schuchardt


During initial image replication ("get initial image") a state-flush operation 
is performed to ensure that all in-flight operations are applied to the region 
being replicated prior to replication starting.  If multicast is enabled for a 
region it is currently possible for the state-flush to miss one or more 
in-flight operations, so that the new repilcate is missing changes that are 
reflected in the region being replicated.

For example, process A sends a multicast put() replication message to process 
B.  Simultaneously process C is replicating the affected region and performs a 
state-flush.  Process A sends a state-stabilization message to process B noting 
its multicast channel state (NAKACK2 outgoing message counter).  Process B 
receives this and waits for the multicast channel state to show that it has 
received all of the messages.  Process B then sends a state-stabilized message 
to process C (the new replicate).

The state-stabilization algorithm in this case is faulty because it is 
performed in the waiting-thread pool.  The algorithm assumes that it is 
executing in the serial-executor thread pool so that any messages that happened 
before it have been applied to the region.  This can allow messages to have 
been received and scheduled for the serial-executor but not be applied to the 
region before replication begins.

The membership manager should be modified to ensure that the serial-executor 
queue has been flushed before giving the state-flush operation the go-ahead to 
begin replication.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to