[jira] [Commented] (GEODE-9633) Region and gateway receiver init order may cause a hang

Alexander Murmann (Jira) Wed, 13 Oct 2021 10:22:04 -0700


    [ 
https://issues.apache.org/jira/browse/GEODE-9633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17428359#comment-17428359
 ]


Alexander Murmann commented on GEODE-9633:
------------------------------------------

Hi [~alberto.bustamante.reyes]! Thanks for digging this up! What version did 
you encounter this on? Did this happen on an already released version or on 
develop?

> Region and gateway receiver init order may cause a hang
> -------------------------------------------------------
>
>                 Key: GEODE-9633
>                 URL: https://issues.apache.org/jira/browse/GEODE-9633
>             Project: Geode
>          Issue Type: Bug
>            Reporter: Alberto Bustamante Reyes
>            Priority: Major
>
> This ticket has been created as suggested on [the dev 
> list|https://markmail.org/thread/qq32z5hducjoqndz].
> -----
> I have been analyzing an issue that occurs in the following scenario:
> 1) I start two Geode clusters (cluster1 & cluster2) with one locator and two
> servers each.
> Both clusters host a partitioned region called "testregion", which is 
> replicated
> using a parallel gateway sender and a gateway receiver.
> ( These are [the gfsh 
> files|https://gist.github.com/alb3rtobr/e230623255632937fa68265f31e97f3a] I 
> have been using for creating the clusters)
> 2) I run a client connected to cluster2 performing operations on testregion.
> 3) cluster1 is stopped and all persistent data is deleted. And then, I create
> cluster1 again.
> 4) At this point, the command to create "testregion" get stuck.
> After checking the thread stack and the code, I found that the problem is the
> following.
> This thread is trapped on an infinite loop waiting for a bucket primary 
> election
> at "PartitionedRegion.waitForNoStorageOrPrimary":
> {code}
> "Function Execution Processor4" tid=0x55
>     java.lang.Thread.State: TIMED_WAITING
>         at java.base@11.0.11/java.lang.Object.wait(Native Method)
>         -  waiting on org.apache.geode.internal.cache.BucketAdvisor@28be7ae0
>         at
> app//org.apache.geode.internal.cache.BucketAdvisor.waitForPrimaryMember(BucketAdvisor.java:1433)
>         at
> app//org.apache.geode.internal.cache.BucketAdvisor.waitForNewPrimary(BucketAdvisor.java:825)
>         at
> app//org.apache.geode.internal.cache.BucketAdvisor.getPrimary(BucketAdvisor.java:794)
>         at
> app//org.apache.geode.internal.cache.partitioned.RegionAdvisor.getPrimaryMemberForBucket(RegionAdvisor.java:1032)
>         at
> app//org.apache.geode.internal.cache.PartitionedRegion.getBucketPrimary(PartitionedRegion.java:9081)
>         at
> app//org.apache.geode.internal.cache.PartitionedRegion.waitForNoStorageOrPrimary(PartitionedRegion.java:3249)
>         at
> app//org.apache.geode.internal.cache.PartitionedRegion.getNodeForBucketWrite(PartitionedRegion.java:3234)
>         at
> app//org.apache.geode.internal.cache.PartitionedRegion.shadowPRWaitForBucketRecovery(PartitionedRegion.java:10110)
>         at
> app//org.apache.geode.internal.cache.wan.parallel.ParallelGatewaySenderQueue.addShadowPartitionedRegionForUserPR(ParallelGatewaySenderQueue.java:564)
>         at
> app//org.apache.geode.internal.cache.wan.parallel.ParallelGatewaySenderQueue.addShadowPartitionedRegionForUserPR(ParallelGatewaySenderQueue.java:443)
>         at
> app//org.apache.geode.internal.cache.wan.parallel.ParallelGatewaySenderEventProcessor.addShadowPartitionedRegionForUserPR(ParallelGatewaySenderEventProcessor.java:195)
>         at
> app//org.apache.geode.internal.cache.wan.parallel.ConcurrentParallelGatewaySenderQueue.addShadowPartitionedRegionForUserPR(ConcurrentParallelGatewaySenderQueue.java:183)
>         at
> app//org.apache.geode.internal.cache.PartitionedRegion.postCreateRegion(PartitionedRegion.java:1177)
>         at
> app//org.apache.geode.internal.cache.GemFireCacheImpl.createVMRegion(GemFireCacheImpl.java:3050)
>         at
> app//org.apache.geode.internal.cache.GemFireCacheImpl.basicCreateRegion(GemFireCacheImpl.java:2910)
>         at
> app//org.apache.geode.internal.cache.GemFireCacheImpl.createRegion(GemFireCacheImpl.java:2894)
>         at
> app//org.apache.geode.cache.RegionFactory.create(RegionFactory.java:773)
> {code}
> After creating testregion, the sender queue partitioned region is created. 
> While
> that region buckets are recovered the command is trapped on an infinite loop
> waiting for a primary bucket election at
> PartitionedRegion.waitForNoStorageOrPrimary.
> This seems to be a known issue because in
> PartitionedRegion.getNodeForBucketWrite, there is the following command before
> calling waitForNoStorageOrPrimary (and the command has been there since 
> Geode's
> first commit!) :
> {code}
>     // Possible race with loss of redundancy at this point.
>     // This loop can possibly create a soft hang if no primary is ever 
> selected.
>     // This is preferable to returning null since it will prevent obtaining 
> the
>     // bucket lock for bucket creation.
>     return waitForNoStorageOrPrimary(bucketId, "write");
> {code}
> Any idea about why the primary bucket is not elected?
> It seems the failure is related with the fact that "testregion" is receiving
> updates from the receiver before the "create region" command has finished. If
> the test is repeated without traffic on cluster2 or if I create the cluster1's
> receiver after creating "testregion", this problem is not happening.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (GEODE-9633) Region and gateway receiver init order may cause a hang

Reply via email to