[ https://issues.apache.org/jira/browse/GEODE-9633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17428359#comment-17428359 ]
Alexander Murmann commented on GEODE-9633: ------------------------------------------ Hi [~alberto.bustamante.reyes]! Thanks for digging this up! What version did you encounter this on? Did this happen on an already released version or on develop? > Region and gateway receiver init order may cause a hang > ------------------------------------------------------- > > Key: GEODE-9633 > URL: https://issues.apache.org/jira/browse/GEODE-9633 > Project: Geode > Issue Type: Bug > Reporter: Alberto Bustamante Reyes > Priority: Major > > This ticket has been created as suggested on [the dev > list|https://markmail.org/thread/qq32z5hducjoqndz]. > ----- > I have been analyzing an issue that occurs in the following scenario: > 1) I start two Geode clusters (cluster1 & cluster2) with one locator and two > servers each. > Both clusters host a partitioned region called "testregion", which is > replicated > using a parallel gateway sender and a gateway receiver. > ( These are [the gfsh > files|https://gist.github.com/alb3rtobr/e230623255632937fa68265f31e97f3a] I > have been using for creating the clusters) > 2) I run a client connected to cluster2 performing operations on testregion. > 3) cluster1 is stopped and all persistent data is deleted. And then, I create > cluster1 again. > 4) At this point, the command to create "testregion" get stuck. > After checking the thread stack and the code, I found that the problem is the > following. > This thread is trapped on an infinite loop waiting for a bucket primary > election > at "PartitionedRegion.waitForNoStorageOrPrimary": > {code} > "Function Execution Processor4" tid=0x55 > java.lang.Thread.State: TIMED_WAITING > at java.base@11.0.11/java.lang.Object.wait(Native Method) > - waiting on org.apache.geode.internal.cache.BucketAdvisor@28be7ae0 > at > app//org.apache.geode.internal.cache.BucketAdvisor.waitForPrimaryMember(BucketAdvisor.java:1433) > at > app//org.apache.geode.internal.cache.BucketAdvisor.waitForNewPrimary(BucketAdvisor.java:825) > at > app//org.apache.geode.internal.cache.BucketAdvisor.getPrimary(BucketAdvisor.java:794) > at > app//org.apache.geode.internal.cache.partitioned.RegionAdvisor.getPrimaryMemberForBucket(RegionAdvisor.java:1032) > at > app//org.apache.geode.internal.cache.PartitionedRegion.getBucketPrimary(PartitionedRegion.java:9081) > at > app//org.apache.geode.internal.cache.PartitionedRegion.waitForNoStorageOrPrimary(PartitionedRegion.java:3249) > at > app//org.apache.geode.internal.cache.PartitionedRegion.getNodeForBucketWrite(PartitionedRegion.java:3234) > at > app//org.apache.geode.internal.cache.PartitionedRegion.shadowPRWaitForBucketRecovery(PartitionedRegion.java:10110) > at > app//org.apache.geode.internal.cache.wan.parallel.ParallelGatewaySenderQueue.addShadowPartitionedRegionForUserPR(ParallelGatewaySenderQueue.java:564) > at > app//org.apache.geode.internal.cache.wan.parallel.ParallelGatewaySenderQueue.addShadowPartitionedRegionForUserPR(ParallelGatewaySenderQueue.java:443) > at > app//org.apache.geode.internal.cache.wan.parallel.ParallelGatewaySenderEventProcessor.addShadowPartitionedRegionForUserPR(ParallelGatewaySenderEventProcessor.java:195) > at > app//org.apache.geode.internal.cache.wan.parallel.ConcurrentParallelGatewaySenderQueue.addShadowPartitionedRegionForUserPR(ConcurrentParallelGatewaySenderQueue.java:183) > at > app//org.apache.geode.internal.cache.PartitionedRegion.postCreateRegion(PartitionedRegion.java:1177) > at > app//org.apache.geode.internal.cache.GemFireCacheImpl.createVMRegion(GemFireCacheImpl.java:3050) > at > app//org.apache.geode.internal.cache.GemFireCacheImpl.basicCreateRegion(GemFireCacheImpl.java:2910) > at > app//org.apache.geode.internal.cache.GemFireCacheImpl.createRegion(GemFireCacheImpl.java:2894) > at > app//org.apache.geode.cache.RegionFactory.create(RegionFactory.java:773) > {code} > After creating testregion, the sender queue partitioned region is created. > While > that region buckets are recovered the command is trapped on an infinite loop > waiting for a primary bucket election at > PartitionedRegion.waitForNoStorageOrPrimary. > This seems to be a known issue because in > PartitionedRegion.getNodeForBucketWrite, there is the following command before > calling waitForNoStorageOrPrimary (and the command has been there since > Geode's > first commit!) : > {code} > // Possible race with loss of redundancy at this point. > // This loop can possibly create a soft hang if no primary is ever > selected. > // This is preferable to returning null since it will prevent obtaining > the > // bucket lock for bucket creation. > return waitForNoStorageOrPrimary(bucketId, "write"); > {code} > Any idea about why the primary bucket is not elected? > It seems the failure is related with the fact that "testregion" is receiving > updates from the receiver before the "create region" command has finished. If > the test is repeated without traffic on cluster2 or if I create the cluster1's > receiver after creating "testregion", this problem is not happening. -- This message was sent by Atlassian Jira (v8.3.4#803005)