[ https://issues.apache.org/jira/browse/GEODE-9191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17339313#comment-17339313 ]
Xiaojian Zhou edited comment on GEODE-9191 at 5/27/21, 6:30 PM: ---------------------------------------------------------------- More investigation found that the primary buckets could switch at any time especially when they are not balanced (usually happened in GII). We need to lock the primary from moving. The revised design will be: (1) coordinator(a server) assignAllBuckets. (2) coordinator lock local primary buckets and sends lock message to all peer members. (3) upon received the lock message, each datastore server will: - lockBucketCreationForRegionClear (Maybe we don't need it) - waits for all the primaries to show up - iterate through local primary bucket list to lock primary from moving, then lock RVV - reply with number of buckets locked (4) coordinator collected locked bucket numbers from each member, if matched the expected total bucket number (i.e. default is 113), move on to next step. Otherwise, retry. - It's possible while iterating through the local primary bucket list, some of the primary bucket is no longer primary, in this case, the sum of locked primary bucket numbers that collected by the coordinator could be different with the expected total bucket number. Then coordinator will unlock all the members and retry. - Retry until succeed or failed with PartialClearException. Retry will usually succeed, unless there're too many servers shutdown. In that case, waitForPrimary will fail with PartialClearException and break the endless retry. (5) If a member is down, the membership listener will detected and let coordinator to retry. If too many members are down, wait for primary should fail with PartitionedRegionPartialClearException. Then coordinator will unlock and throw this exception to caller. If the coordinator is down, the pr clear will fail. Note: membership listener will trigger coordinator to retry from beginning, not to let each member to retry locally, because the primary list might have changed. (6) After locked all the members' primary buckets (both locked primary and locked RVV), the coordinator sends clear message to all the members. (7) each member clear primary buckets one by one and return number of buckets cleared. (8) Coordinator collect all the numbers cleared, if less than expected bucket number, retry. This could happen when a member is offline in the middle of clear. The retry should succeed finally unless too many servers are down. Then the waitForPrimary will throw PartialClearException. (9) In unlock, the coordinator should send UNLOCK message to all the members to unlock not only primary buckets, because the primary list at each member could have been changed. - should iterate through all the local buckets and unlock RVV, then unlock primary moving. - getLockRequester()==null means coordinator is down. In that case, we should still do unlock. - Since the PR clear will retry forever until succeeded or fail with ParticlalClearException due to too many members are down, there's no need to lockBucketCreationForRegionClear. - if a member is down, the listener is triggered at all the members, these members will check who is down. If the down member is the non-coordinator, then coordinator will notify all other members to unlock and cleanup. If it’s the coordinator shutdown, other members should unlock and cleanup. - if coordinator is down: should rely on listener to let each member to unlock their own local buckets (unlock on unlocked buckets is ok) Test cases to be added in other geode ticket: (1) secondary become new primary which is unlocked, how the on-going operation sync with pr clear (2) a new operation create a new bucket which is also a primary, how to sync with pr clear| (3) fixed partition, how it sync with assignAllBuckets and PR clear (4) if shutdown 2 servers in redundancy=1, should call assignAllBuckets() to recreate buckets in existing servers, unless PartitionOfflineException was (Author: zhouxj): More investigation found that the primary buckets could switch at any time especially when they are not balanced (usually happened in GII). We need to lock the primary from moving. The revised design will be: (1) coordinator(a server) assignAllBuckets. (2) coordinator lock local primary buckets and sends lock message to all peer members. (3) upon received the lock message, each datastore server will: - lockBucketCreationForRegionClear (Maybe we don't need it) - waits for all the primaries to show up - iterate through local primary bucket list to lock primary from moving, then lock RVV - reply with number of buckets locked (4) coordinator collected locked bucket numbers from each member, if matched the expected total bucket number (i.e. default is 113), move on to next step. Otherwise, retry. - It's possible while iterating through the local primary bucket list, some of the primary bucket is no longer primary, in this case, the sum of locked primary bucket numbers that collected by the coordinator could be different with the expected total bucket number. Then coordinator will unlock all the members and retry. - Retry until succeed or failed with PartialClearException. Retry will usually succeed, unless there're too many servers shutdown. In that case, waitForPrimary will fail with PartialClearException and break the endless retry. (5) If a member is down, the membership listener will detected and let coordinator to retry. If too many members are down, wait for primary should fail with PartitionedRegionPartialClearException. Then coordinator will unlock and throw this exception to caller. If the coordinator is down, the pr clear will fail. Note: membership listener will trigger coordinator to retry from beginning, not to let each member to retry locally, because the primary list might have changed. (6) After locked all the members' primary buckets (both locked primary and locked RVV), the coordinator sends clear message to all the members. (7) each member clear primary buckets one by one and return number of buckets cleared. (8) Coordinator collect all the numbers cleared, if less than expected bucket number, retry. This could happen when a member is offline in the middle of clear. The retry should succeed finally unless too many servers are down. Then the waitForPrimary will throw PartialClearException. (9) In unlock, the coordinator should send UNLOCK message to all the members to unlock not only primary buckets, because the primary list at each member could have been changed. - should iterate through all the local buckets and unlock RVV, then unlock primary moving. - getLockRequester()==null means coordinator is down. In that case, we should still do unlock. - Since the PR clear will retry forever until succeeded or fail with ParticlalClearException due to too many members are down, there's no need to lockBucketCreationForRegionClear. > PR clear could miss clearing bucket which lost primary > ------------------------------------------------------ > > Key: GEODE-9191 > URL: https://issues.apache.org/jira/browse/GEODE-9191 > Project: Geode > Issue Type: Sub-task > Reporter: Xiaojian Zhou > Assignee: Xiaojian Zhou > Priority: Major > Labels: GeodeOperationAPI, pull-request-available > > This scenario is found when introducing GII test case for PR clear. The > sequence is: > (1) there're 3 servers, server1 is accessor, server2 and server3 are > datastores. > (2) shutdown server2 > (3) send PR clear from server1 (accessor) and restart server2 at the same > time. There's a race that server2 did not receive the > PartitionedRegionClearMessage. > (4) server2 finished GII > (5) only server3 received PartitionedRegionClearMessage and it hosts all the > primary buckets. When PR clear thread iterates through these primary buckets > one by one, some of them might lose primary to server2. > (6) BR.cmnClearRegion will return immediately since it's no longer primary, > but clearedBuckets.add(localPrimaryBucketRegion.getId()); will still be > called. So from the caller point of view, this bucket is cleared. It wouldn't > even throw PartitionedRegionPartialClearException. > The problem is: > before calling cmnClearRegion, we should call BR.doLockForPrimary to make > sure it's still primary. If not, throw exception. Then > clearedBuckets.add(localPrimaryBucketRegion.getId()); will not be called for > this bucket. > The expected behavior in this scenario is to throw > PartitionedRegionPartialClearException. -- This message was sent by Atlassian Jira (v8.3.4#803005)