Please file a bug for this. Even if that ordering doesn't work, nothing should hang because of it.
Thanks, Kirk On Wed, Jul 28, 2021 at 5:27 PM Barrett Oglesby <bogle...@vmware.com> wrote: > I reproduced your issue with your scripts. > > They do: > > create gateway-receiver > create disk-store > create gateway-sender > create region > > With that order, I see the hang you mentioned. I'm not 100% sure why that > is happening but you can prevent it by reordering these elements. > > As Anil said, you should start your GatewayReceiver last like: > > create disk-store > create gateway-sender > create region > create gateway-receiver > > With that order, cluster1 restarts fine. > > btw 1 - with the order you had regardless of the hang, you'll see lots of > dropped WAN events since the region doesn't exist yet when the receiver is > started: > > [info 2021/07/28 17:02:39.795 PDT server1_1 <Function Execution > Processor2> tid=0x3c] The GatewayReceiver started on port : 5411 > > [warn 2021/07/28 17:02:39.883 PDT server1_1 <ServerConnection on port 5411 > Thread 1> tid=0x4a] Server connection from > [identity(192.168.1.7(server2_2:25891)<v2>:41005,connection=1; port=52554]: > Caught exception processing batch create request 0 for 100 events > org.apache.geode.cache.RegionDestroyedException: Region /testregion was > not found during batch create request 0 > > btw 2 - I use CacheCreation.create to see the order that elements should > be started. Thats the object that the old GemFire cache xml uses to start > things in the right order. > > Barry > ________________________________ > From: Anilkumar Gingade <aging...@vmware.com> > Sent: Wednesday, July 28, 2021 3:45 PM > To: dev@geode.apache.org <dev@geode.apache.org> > Subject: Re: "create region" cmd stuck on wan setup > > The recommendation with WAN setup is: > - Create/start WAN Senders first > - Create Regions > - Create/Start WAN receivers last > > That way when wan receiver is started; the regions are created on all the > sites. Sorry, I have not looked at your scripts... > > -Anil. > > > > On 7/28/21, 3:31 AM, "Alberto Bustamante Reyes" > <alberto.bustamante.re...@est.tech> wrote: > > Hi Geode devs, > > I have been analyzing an issue that occurs in the following scenario: > > 1) I start two Geode clusters (cluster1 & cluster2) with one locator > and two servers each. > Both clusters host a partitioned region called "testregion", which is > replicated using a parallel gateway sender and a gateway receiver. > These are the gfsh files I have been using for creating the clusters: > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgist.github.com%2Falb3rtobr%2Fe230623255632937fa68265f31e97f3a&data=04%7C01%7Cboglesby%40vmware.com%7C6e6bff680f5d46c6bbcc08d952195ff7%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637631091210347322%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=V%2Fnsqn8wiEnEpjf9GQZ4Ta38rPk5ha79RYqlZWZIXzY%3D&reserved=0 > > 2) I run a client connected to cluster2 performing operations on > testregion. > > 3) cluster1 is stopped and all persistent data is deleted. And then, I > create cluster1 again. > > 4) At this point, the command to create "testregion" get stuck. > > > After checking the thread stack and the code, I found that the problem > is the following. > > This thread is trapped on an infinite loop waiting for a bucket > primary election at "PartitionedRegion.waitForNoStorageOrPrimary": > > > "Function Execution Processor4" tid=0x55 > java.lang.Thread.State: TIMED_WAITING > at java.base@11.0.11/java.lang.Object.wait(Native Method) > - waiting on > org.apache.geode.internal.cache.BucketAdvisor@28be7ae0 > at > app//org.apache.geode.internal.cache.BucketAdvisor.waitForPrimaryMember(BucketAdvisor.java:1433) > at > app//org.apache.geode.internal.cache.BucketAdvisor.waitForNewPrimary(BucketAdvisor.java:825) > at > app//org.apache.geode.internal.cache.BucketAdvisor.getPrimary(BucketAdvisor.java:794) > at > app//org.apache.geode.internal.cache.partitioned.RegionAdvisor.getPrimaryMemberForBucket(RegionAdvisor.java:1032) > at > app//org.apache.geode.internal.cache.PartitionedRegion.getBucketPrimary(PartitionedRegion.java:9081) > at > app//org.apache.geode.internal.cache.PartitionedRegion.waitForNoStorageOrPrimary(PartitionedRegion.java:3249) > at > app//org.apache.geode.internal.cache.PartitionedRegion.getNodeForBucketWrite(PartitionedRegion.java:3234) > at > app//org.apache.geode.internal.cache.PartitionedRegion.shadowPRWaitForBucketRecovery(PartitionedRegion.java:10110) > at > app//org.apache.geode.internal.cache.wan.parallel.ParallelGatewaySenderQueue.addShadowPartitionedRegionForUserPR(ParallelGatewaySenderQueue.java:564) > at > app//org.apache.geode.internal.cache.wan.parallel.ParallelGatewaySenderQueue.addShadowPartitionedRegionForUserPR(ParallelGatewaySenderQueue.java:443) > at > app//org.apache.geode.internal.cache.wan.parallel.ParallelGatewaySenderEventProcessor.addShadowPartitionedRegionForUserPR(ParallelGatewaySenderEventProcessor.java:195) > at > app//org.apache.geode.internal.cache.wan.parallel.ConcurrentParallelGatewaySenderQueue.addShadowPartitionedRegionForUserPR(ConcurrentParallelGatewaySenderQueue.java:183) > at > app//org.apache.geode.internal.cache.PartitionedRegion.postCreateRegion(PartitionedRegion.java:1177) > at > app//org.apache.geode.internal.cache.GemFireCacheImpl.createVMRegion(GemFireCacheImpl.java:3050) > at > app//org.apache.geode.internal.cache.GemFireCacheImpl.basicCreateRegion(GemFireCacheImpl.java:2910) > at > app//org.apache.geode.internal.cache.GemFireCacheImpl.createRegion(GemFireCacheImpl.java:2894) > at > app//org.apache.geode.cache.RegionFactory.create(RegionFactory.java:773) > > > After creating testregion, the sender queue partitioned region is > created. While that region buckets are recovered the command is trapped on > an infinite loop waiting for a primary bucket election at > PartitionedRegion.waitForNoStorageOrPrimary. > > This seems to be a known issue because in > PartitionedRegion.getNodeForBucketWrite, there is the following command > before calling waitForNoStorageOrPrimary (and the command has been there > since Geode's first commit!) : > > // Possible race with loss of redundancy at this point. > // This loop can possibly create a soft hang if no primary is ever > selected. > // This is preferable to returning null since it will prevent > obtaining the > // bucket lock for bucket creation. > return waitForNoStorageOrPrimary(bucketId, "write"); > > Any idea about why the primary bucket is not elected? > > It seems the failure is related with the fact that "testregion" is > receiving updates from the receiver before the "create region" command has > finished. If the test is repeated without traffic on cluster2 or if I > create the cluster1's receiver after creating "testregion", this problem is > not happening. > > Is there any recommendation on the startup order of regions, senders > and receivers for an scenario like the one described? > > Thanks in advance, > > Alberto B. > >