Bruce Schuchardt created GEODE-2865: ---------------------------------------
Summary: data loss in initial-image replication with multicast Key: GEODE-2865 URL: https://issues.apache.org/jira/browse/GEODE-2865 Project: Geode Issue Type: Bug Components: messaging Reporter: Bruce Schuchardt During initial image replication ("get initial image") a state-flush operation is performed to ensure that all in-flight operations are applied to the region being replicated prior to replication starting. If multicast is enabled for a region it is currently possible for the state-flush to miss one or more in-flight operations, so that the new repilcate is missing changes that are reflected in the region being replicated. For example, process A sends a multicast put() replication message to process B. Simultaneously process C is replicating the affected region and performs a state-flush. Process A sends a state-stabilization message to process B noting its multicast channel state (NAKACK2 outgoing message counter). Process B receives this and waits for the multicast channel state to show that it has received all of the messages. Process B then sends a state-stabilized message to process C (the new replicate). The state-stabilization algorithm in this case is faulty because it is performed in the waiting-thread pool. The algorithm assumes that it is executing in the serial-executor thread pool so that any messages that happened before it have been applied to the region. This can allow messages to have been received and scheduled for the serial-executor but not be applied to the region before replication begins. The membership manager should be modified to ensure that the serial-executor queue has been flushed before giving the state-flush operation the go-ahead to begin replication. -- This message was sent by Atlassian JIRA (v6.3.15#6346)