[ 
https://issues.apache.org/jira/browse/GEODE-9191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17339313#comment-17339313
 ] 

Xiaojian Zhou edited comment on GEODE-9191 at 5/27/21, 6:30 PM:
----------------------------------------------------------------

More investigation found that the primary buckets could switch at any time 
especially when they are not balanced (usually happened in GII). We need to 
lock the primary from moving.

The revised design will be:

(1) coordinator(a server) assignAllBuckets. 

(2) coordinator lock local primary buckets and sends lock message to all peer 
members. 

(3) upon received the lock message, each datastore server will:
- lockBucketCreationForRegionClear (Maybe we don't need it)
- waits for all the primaries to show up
- iterate through local primary bucket list to lock primary from moving, then 
lock RVV
- reply with number of buckets locked 

(4) coordinator collected locked bucket numbers from each member, if matched 
the expected total bucket number (i.e. default is 113), move on to next step. 
Otherwise, retry.
- It's possible while iterating through the local primary bucket list, some of 
the primary bucket is no longer primary, in this case, the sum of locked 
primary bucket numbers that collected by the coordinator could be different 
with the expected total bucket number. 
Then coordinator will unlock all the members and retry. 
- Retry until succeed or failed with PartialClearException. Retry will usually 
succeed, unless there're too many servers shutdown. In that case, 
waitForPrimary will fail with PartialClearException and break the endless 
retry. 

(5) If a member is down, the membership listener will detected and let 
coordinator to retry. If too many members are down, wait for primary should 
fail with PartitionedRegionPartialClearException. Then coordinator will unlock 
and throw this exception to caller. If the coordinator is down, the pr clear 
will fail. 
Note: membership listener will trigger coordinator to retry from beginning, not 
to let each member to retry locally, because the primary list might have 
changed. 

(6) After locked all the members' primary buckets (both locked primary and 
locked RVV), the coordinator sends clear message to all the members.

(7) each member clear primary buckets one by one and return number of buckets 
cleared.

(8) Coordinator collect all the numbers cleared, if less than expected bucket 
number, retry.  This could happen when a member is offline in the middle of 
clear. The retry should succeed finally unless too many servers are down. Then 
the waitForPrimary will throw PartialClearException. 

(9) In unlock, the coordinator should send UNLOCK message to all the members to 
unlock not only primary buckets, because the primary list at each member could 
have been changed. 
- should iterate through all the local buckets and unlock RVV, then unlock 
primary moving. 
- getLockRequester()==null means coordinator is down. In that case, we should 
still do unlock. 
- Since the PR clear will retry forever until succeeded or fail with 
ParticlalClearException due to too many members are down, there's no need to 
lockBucketCreationForRegionClear.
- if a member is down, the listener is triggered at all the members, these 
members will check who is down. If the down member is the non-coordinator, then 
coordinator will notify all other members to unlock and cleanup. If it’s the 
coordinator shutdown, other members should unlock and cleanup. 
- if coordinator is down: should rely on listener to let each member to unlock 
their own local buckets (unlock on unlocked buckets is ok)  

Test cases to be added in other geode ticket:
(1) secondary become new primary which is unlocked, how the on-going operation 
sync with pr clear
(2) a new operation create a new bucket which is also a primary, how to sync 
with pr clear|
(3) fixed partition, how it sync with assignAllBuckets and PR clear
(4) if shutdown 2 servers in redundancy=1, should call assignAllBuckets() to 
recreate buckets in existing servers, unless PartitionOfflineException



was (Author: zhouxj):
More investigation found that the primary buckets could switch at any time 
especially when they are not balanced (usually happened in GII). We need to 
lock the primary from moving.

The revised design will be:

(1) coordinator(a server) assignAllBuckets. 

(2) coordinator lock local primary buckets and sends lock message to all peer 
members. 

(3) upon received the lock message, each datastore server will:
- lockBucketCreationForRegionClear (Maybe we don't need it)
- waits for all the primaries to show up
- iterate through local primary bucket list to lock primary from moving, then 
lock RVV
- reply with number of buckets locked 

(4) coordinator collected locked bucket numbers from each member, if matched 
the expected total bucket number (i.e. default is 113), move on to next step. 
Otherwise, retry.
- It's possible while iterating through the local primary bucket list, some of 
the primary bucket is no longer primary, in this case, the sum of locked 
primary bucket numbers that collected by the coordinator could be different 
with the expected total bucket number. 
Then coordinator will unlock all the members and retry. 
- Retry until succeed or failed with PartialClearException. Retry will usually 
succeed, unless there're too many servers shutdown. In that case, 
waitForPrimary will fail with PartialClearException and break the endless 
retry. 

(5) If a member is down, the membership listener will detected and let 
coordinator to retry. If too many members are down, wait for primary should 
fail with PartitionedRegionPartialClearException. Then coordinator will unlock 
and throw this exception to caller. If the coordinator is down, the pr clear 
will fail. 
Note: membership listener will trigger coordinator to retry from beginning, not 
to let each member to retry locally, because the primary list might have 
changed. 

(6) After locked all the members' primary buckets (both locked primary and 
locked RVV), the coordinator sends clear message to all the members.

(7) each member clear primary buckets one by one and return number of buckets 
cleared.

(8) Coordinator collect all the numbers cleared, if less than expected bucket 
number, retry.  This could happen when a member is offline in the middle of 
clear. The retry should succeed finally unless too many servers are down. Then 
the waitForPrimary will throw PartialClearException. 

(9) In unlock, the coordinator should send UNLOCK message to all the members to 
unlock not only primary buckets, because the primary list at each member could 
have been changed. 
- should iterate through all the local buckets and unlock RVV, then unlock 
primary moving. 
- getLockRequester()==null means coordinator is down. In that case, we should 
still do unlock. 
- Since the PR clear will retry forever until succeeded or fail with 
ParticlalClearException due to too many members are down, there's no need to 
lockBucketCreationForRegionClear. 


> PR clear could miss clearing bucket which lost primary
> ------------------------------------------------------
>
>                 Key: GEODE-9191
>                 URL: https://issues.apache.org/jira/browse/GEODE-9191
>             Project: Geode
>          Issue Type: Sub-task
>            Reporter: Xiaojian Zhou
>            Assignee: Xiaojian Zhou
>            Priority: Major
>              Labels: GeodeOperationAPI, pull-request-available
>
> This scenario is found when introducing GII test case for PR clear. The 
> sequence is:
> (1) there're 3 servers, server1 is accessor, server2 and server3 are 
> datastores.
> (2) shutdown server2
> (3) send PR clear from server1 (accessor) and restart server2 at the same 
> time. There's a race that server2 did not receive the 
> PartitionedRegionClearMessage.
> (4) server2 finished GII
> (5) only server3 received PartitionedRegionClearMessage and it hosts all the 
> primary buckets. When PR clear thread iterates through these primary buckets 
> one by one, some of them might lose primary to server2. 
> (6) BR.cmnClearRegion will return immediately since it's no longer primary, 
> but clearedBuckets.add(localPrimaryBucketRegion.getId()); will still be 
> called. So from the caller point of view, this bucket is cleared. It wouldn't 
> even throw PartitionedRegionPartialClearException.
> The problem is:
> before calling cmnClearRegion, we should call BR.doLockForPrimary to make 
> sure it's still primary. If not, throw exception.  Then 
> clearedBuckets.add(localPrimaryBucketRegion.getId()); will not be called for 
> this bucket. 
> The expected behavior in this scenario is to throw 
> PartitionedRegionPartialClearException.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to