Bruce Schuchardt created GEODE-2865:
---------------------------------------
Summary: data loss in initial-image replication with multicast
Key: GEODE-2865
URL: https://issues.apache.org/jira/browse/GEODE-2865
Project: Geode
Issue Type: Bug
Components: messaging
Reporter: Bruce Schuchardt
During initial image replication ("get initial image") a state-flush operation
is performed to ensure that all in-flight operations are applied to the region
being replicated prior to replication starting. If multicast is enabled for a
region it is currently possible for the state-flush to miss one or more
in-flight operations, so that the new repilcate is missing changes that are
reflected in the region being replicated.
For example, process A sends a multicast put() replication message to process
B. Simultaneously process C is replicating the affected region and performs a
state-flush. Process A sends a state-stabilization message to process B noting
its multicast channel state (NAKACK2 outgoing message counter). Process B
receives this and waits for the multicast channel state to show that it has
received all of the messages. Process B then sends a state-stabilized message
to process C (the new replicate).
The state-stabilization algorithm in this case is faulty because it is
performed in the waiting-thread pool. The algorithm assumes that it is
executing in the serial-executor thread pool so that any messages that happened
before it have been applied to the region. This can allow messages to have
been received and scheduled for the serial-executor but not be applied to the
region before replication begins.
The membership manager should be modified to ensure that the serial-executor
queue has been flushed before giving the state-flush operation the go-ahead to
begin replication.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)