Please file a bug for this. Even if that ordering doesn't work, nothing
should hang because of it.

Thanks,
Kirk

On Wed, Jul 28, 2021 at 5:27 PM Barrett Oglesby <bogle...@vmware.com> wrote:

> I reproduced your issue with your scripts.
>
> They do:
>
> create gateway-receiver
> create disk-store
> create gateway-sender
> create region
>
> With that order, I see the hang you mentioned. I'm not 100% sure why that
> is happening but you can prevent it by reordering these elements.
>
> As Anil said, you should start your GatewayReceiver last like:
>
> create disk-store
> create gateway-sender
> create region
> create gateway-receiver
>
> With that order, cluster1 restarts fine.
>
> btw 1 - with the order you had regardless of the hang, you'll see lots of
> dropped WAN events since the region doesn't exist yet when the receiver is
> started:
>
> [info 2021/07/28 17:02:39.795 PDT server1_1 <Function Execution
> Processor2> tid=0x3c] The GatewayReceiver started on port : 5411
>
> [warn 2021/07/28 17:02:39.883 PDT server1_1 <ServerConnection on port 5411
> Thread 1> tid=0x4a] Server connection from
> [identity(192.168.1.7(server2_2:25891)<v2>:41005,connection=1; port=52554]:
> Caught exception processing batch create request 0 for 100 events
> org.apache.geode.cache.RegionDestroyedException: Region /testregion was
> not found during batch create request 0
>
> btw 2 - I use CacheCreation.create to see the order that elements should
> be started. Thats the object that the old GemFire cache xml uses to start
> things in the right order.
>
> Barry
> ________________________________
> From: Anilkumar Gingade <aging...@vmware.com>
> Sent: Wednesday, July 28, 2021 3:45 PM
> To: dev@geode.apache.org <dev@geode.apache.org>
> Subject: Re: "create region" cmd stuck on wan setup
>
> The recommendation with WAN setup is:
> - Create/start WAN Senders first
> - Create Regions
> - Create/Start WAN receivers last
>
> That way when wan receiver is started; the regions are created on all the
> sites. Sorry, I have not looked at your scripts...
>
> -Anil.
>
>
>
> On 7/28/21, 3:31 AM, "Alberto Bustamante Reyes"
> <alberto.bustamante.re...@est.tech> wrote:
>
>     Hi Geode devs,
>
>     I have been analyzing an issue that occurs in the following scenario:
>
>     1) I start two Geode clusters (cluster1 & cluster2) with one locator
> and two servers each.
>     Both clusters host a partitioned region called "testregion", which is
> replicated using a parallel gateway sender and a gateway receiver.
>     These are the gfsh files I have been using for creating the clusters:
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgist.github.com%2Falb3rtobr%2Fe230623255632937fa68265f31e97f3a&amp;data=04%7C01%7Cboglesby%40vmware.com%7C6e6bff680f5d46c6bbcc08d952195ff7%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637631091210347322%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=V%2Fnsqn8wiEnEpjf9GQZ4Ta38rPk5ha79RYqlZWZIXzY%3D&amp;reserved=0
>
>     2) I run a client connected to cluster2 performing operations on
> testregion.
>
>     3) cluster1 is stopped and all persistent data is deleted. And then, I
> create cluster1 again.
>
>     4) At this point, the command to create "testregion" get stuck.
>
>
>     After checking the thread stack and the code, I found that the problem
> is the following.
>
>     This thread is trapped on an infinite loop waiting for a bucket
> primary election at "PartitionedRegion.waitForNoStorageOrPrimary":
>
>
>     "Function Execution Processor4" tid=0x55
>         java.lang.Thread.State: TIMED_WAITING
>             at java.base@11.0.11/java.lang.Object.wait(Native Method)
>             -  waiting on
> org.apache.geode.internal.cache.BucketAdvisor@28be7ae0
>             at
> app//org.apache.geode.internal.cache.BucketAdvisor.waitForPrimaryMember(BucketAdvisor.java:1433)
>             at
> app//org.apache.geode.internal.cache.BucketAdvisor.waitForNewPrimary(BucketAdvisor.java:825)
>             at
> app//org.apache.geode.internal.cache.BucketAdvisor.getPrimary(BucketAdvisor.java:794)
>             at
> app//org.apache.geode.internal.cache.partitioned.RegionAdvisor.getPrimaryMemberForBucket(RegionAdvisor.java:1032)
>             at
> app//org.apache.geode.internal.cache.PartitionedRegion.getBucketPrimary(PartitionedRegion.java:9081)
>             at
> app//org.apache.geode.internal.cache.PartitionedRegion.waitForNoStorageOrPrimary(PartitionedRegion.java:3249)
>             at
> app//org.apache.geode.internal.cache.PartitionedRegion.getNodeForBucketWrite(PartitionedRegion.java:3234)
>             at
> app//org.apache.geode.internal.cache.PartitionedRegion.shadowPRWaitForBucketRecovery(PartitionedRegion.java:10110)
>             at
> app//org.apache.geode.internal.cache.wan.parallel.ParallelGatewaySenderQueue.addShadowPartitionedRegionForUserPR(ParallelGatewaySenderQueue.java:564)
>             at
> app//org.apache.geode.internal.cache.wan.parallel.ParallelGatewaySenderQueue.addShadowPartitionedRegionForUserPR(ParallelGatewaySenderQueue.java:443)
>             at
> app//org.apache.geode.internal.cache.wan.parallel.ParallelGatewaySenderEventProcessor.addShadowPartitionedRegionForUserPR(ParallelGatewaySenderEventProcessor.java:195)
>             at
> app//org.apache.geode.internal.cache.wan.parallel.ConcurrentParallelGatewaySenderQueue.addShadowPartitionedRegionForUserPR(ConcurrentParallelGatewaySenderQueue.java:183)
>             at
> app//org.apache.geode.internal.cache.PartitionedRegion.postCreateRegion(PartitionedRegion.java:1177)
>             at
> app//org.apache.geode.internal.cache.GemFireCacheImpl.createVMRegion(GemFireCacheImpl.java:3050)
>             at
> app//org.apache.geode.internal.cache.GemFireCacheImpl.basicCreateRegion(GemFireCacheImpl.java:2910)
>             at
> app//org.apache.geode.internal.cache.GemFireCacheImpl.createRegion(GemFireCacheImpl.java:2894)
>             at
> app//org.apache.geode.cache.RegionFactory.create(RegionFactory.java:773)
>
>
>     After creating testregion, the sender queue partitioned region is
> created. While that region buckets are recovered the command is trapped on
> an infinite loop waiting for a primary bucket election at
> PartitionedRegion.waitForNoStorageOrPrimary.
>
>     This seems to be a known issue because in
> PartitionedRegion.getNodeForBucketWrite, there is the following command
> before calling waitForNoStorageOrPrimary (and the command has been there
> since Geode's first commit!) :
>
>         // Possible race with loss of redundancy at this point.
>         // This loop can possibly create a soft hang if no primary is ever
> selected.
>         // This is preferable to returning null since it will prevent
> obtaining the
>         // bucket lock for bucket creation.
>         return waitForNoStorageOrPrimary(bucketId, "write");
>
>     Any idea about why the primary bucket is not elected?
>
>     It seems the failure is related with the fact that "testregion" is
> receiving updates from the receiver before the "create region" command has
> finished. If the test is repeated without traffic on cluster2 or if I
> create the cluster1's receiver after creating "testregion", this problem is
> not happening.
>
>     Is there any recommendation on the startup order of regions, senders
> and receivers for an scenario like the one described?
>
>     Thanks in advance,
>
>     Alberto B.
>
>

Reply via email to