[jira] [Resolved] (GEODE-10405) Offline server won't start if gateway senders were restarted with clean queue option

2022-09-06 Thread Mario Ivanac (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-10405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mario Ivanac resolved GEODE-10405.
--
Fix Version/s: 1.16.0
   Resolution: Fixed

> Offline server won't start if gateway senders were restarted with clean queue 
> option
> 
>
> Key: GEODE-10405
> URL: https://issues.apache.org/jira/browse/GEODE-10405
> Project: Geode
>  Issue Type: Improvement
>  Components: persistence, regions, wan
>Reporter: Mario Ivanac
>Assignee: Mario Ivanac
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.16.0
>
>
> In case gateway senders were restarted with "clean queue" option, while 
> server was offline, if we want to start offline server, it will fail.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (GEODE-10405) Offline server won't start if gateway senders were restarted with clean queue option

2022-09-06 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/GEODE-10405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17600696#comment-17600696
 ] 

ASF subversion and git services commented on GEODE-10405:
-

Commit 6fcb258a6515f268b119eacd88207860a8750720 in geode's branch 
refs/heads/develop from Mario Ivanac
[ https://gitbox.apache.org/repos/asf?p=geode.git;h=6fcb258a65 ]

GEODE-10405: added ignore exceprion for GW queue region (#7831)

* GEODE-10405: added ignore exceprion for GW queue region

* GEODE-10405: added test

> Offline server won't start if gateway senders were restarted with clean queue 
> option
> 
>
> Key: GEODE-10405
> URL: https://issues.apache.org/jira/browse/GEODE-10405
> Project: Geode
>  Issue Type: Improvement
>  Components: persistence, regions, wan
>Reporter: Mario Ivanac
>Assignee: Mario Ivanac
>Priority: Major
>  Labels: pull-request-available
>
> In case gateway senders were restarted with "clean queue" option, while 
> server was offline, if we want to start offline server, it will fail.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (GEODE-10215) WAN replication not working after re-creating the partitioned region

2022-09-06 Thread Jakov Varenina (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-10215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jakov Varenina resolved GEODE-10215.

Fix Version/s: 1.15.0
   Resolution: Fixed

> WAN replication not working after re-creating the partitioned region
> 
>
> Key: GEODE-10215
> URL: https://issues.apache.org/jira/browse/GEODE-10215
> Project: Geode
>  Issue Type: Bug
>Reporter: Jakov Varenina
>Assignee: Jakov Varenina
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.15.0
>
>
> Steps to reproduce the issue:
> Start multi-site with at least 3 servers on each site. If there are less than 
> three servers then issue will not reproduce.
> Configuration site 1:
> {code:java}
> create disk-store --name=queue_disk_store --dir=ds2
> create gateway-sender -id="remote_site_2" --parallel="true" 
> --remote-distributed-system-id="1"  -enable-persistence=true 
> --disk-store-name=queue_disk_store
> create disk-store --name=data_disk_store --dir=ds1
> create region --name=example-region --type=PARTITION_PERSISTENT 
> --gateway-sender-id="remote_site_2" --disk-store=data_disk_store 
> --total-num-buckets=1103 --redundant-copies=1 --enable-synchronous-disk=false
> #Configure the remote site 2 with the region and the gateway-receiver  
> #Run some traffic so that all buckets are created and data is replicated to 
> the other site
> alter region --name=/example-region --gateway-sender-id=""
> destroy region --name=/example-region
> create region --name=example-region --type=PARTITION_PERSISTENT 
> --gateway-sender-id="remote_site_2" --disk-store=data_disk_store 
> --total-num-buckets=1103 --redundant-copies=1 --enable-synchronous-disk=false
> #run traffic to see that some data is not replicated to the remote site 2 
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (GEODE-10339) The server fails to start because the .crf or the .drf file is missing

2022-09-06 Thread Jakov Varenina (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-10339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jakov Varenina updated GEODE-10339:
---
Description: 
{color:#0e101a}The server fails with following:{color}
{code:java}
{"timestamp":"2022-05-16T08:25:35.708Z","severity":"error","message":"Cache 
initialization for GemFireCache[id = 776315735; isClosing = false; 
isShutDownAll = false; created = Mon May 16 08:25:33 UTC 2022; server = false; 
copyOnRead = false; lockLease = 120; lockTimeout = 60] failed because: 
java.lang.IllegalStateException: The following required files could not be 
found: *.crf files with these ids: 
[33].","metadata":{"function":"KVDB"},"version":"1.1.0","service_id":"eric-udr-kvdb-ag","extra_data":{"thread_info":{"thread_name":"main","thread_id":"1"},"e":{"exception":""}}}
{code}
 

{color:#0e101a}As a last compaction step, the server deletes the compacted .crf 
file. The deletion is done in the following way:{color}
 # {color:#0e101a}Write delete operation (delete ".crf" file) in the ".if" 
file. {color}
 # {color:#0e101a}Delete .crf file{color}

{color:#0e101a}The problem with server startup happens in the following 
scenario:{color}
 # {color:#0e101a}The server writes the delete operation (for .crf file) in the 
".if" file. The write is not immediately flushed to the ".if" file, but it goes 
to the async write buffer.{color}
 # {color:#0e101a}The server deletes the ".crf" file.{color}
 # {color:#0e101a}The forceful restart (of the machine where the server 
resides) happens before the async write buffer is flushed to the ".if" file. 
This scenario leaves the ".if" file not updated, and therefore server startup 
fails later on from the persisted region data.{color}

 

{color:#0e101a}To avoid the above issue, we can use the existing parameter in a 
geode that forces the server to write synchronously to the ".if" file:{color}
{code:java}
--J=-Dgemfire.syncMetaDataWrites=true
{code}
{color:#0e101a}This parameter is not mentioned anywhere in the documentation. 
So it would be good to add it to the following document:{color}

{color:#0e101a}[https://geode.apache.org/docs/guide/114/managing/disk_storage/managing_disk_buffer_flushes.html]{color}

  was:
{color:#0e101a}The server fails with following:{color}
{code:java}
{"timestamp":"2022-05-16T08:25:35.708Z","severity":"error","message":"Cache 
initialization for GemFireCache[id = 776315735; isClosing = false; 
isShutDownAll = false; created = Mon May 16 08:25:33 UTC 2022; server = false; 
copyOnRead = false; lockLease = 120; lockTimeout = 60] failed because: 
java.lang.IllegalStateException: The following required files could not be 
found: *.crf files with these ids: 
[33].","metadata":{"function":"KVDB"},"version":"1.1.0","service_id":"eric-udr-kvdb-ag","extra_data":{"thread_info":{"thread_name":"main","thread_id":"1"},"e":{"exception":""}}}
{code}
 

{color:#0e101a}As a last compaction step, the server deletes the compacted .crf 
file. The deletion is done in the following way:{color}
 # {color:#0e101a}Write delete operation (delete ".crf" file) in the ".if" 
file. {color}
 # {color:#0e101a}Delete .crf file{color}

{color:#0e101a}The problem with server startup happens in the following 
scenario:{color}
 # {color:#0e101a}The server writes the delete operation (for .crf file) in the 
".if" file. The write is not immediately flushed to the ".if" file, but it goes 
to the async write buffer.{color}
 # {color:#0e101a}The server deletes the ".crf" file.{color}
 # {color:#0e101a}The forceful restart (of the machine where the server 
resides) happens before the async write buffer is flushed to the ".if" file. 
This scenario leaves the ".if" file not updated, and therefore server startup 
fails later on from the persisted region data.{color}

 

{color:#0e101a}To avoid the above issue, we can use the existing parameter in a 
geode that forces the server to write synchronously to the ".if" file:{color}
{code:java}
--J=-Dgemfire.syncMetaDataWrites=true
{code}
{color:#0e101a}This parameter is not mentioned anywhere in the documentation. 
So it would be good to add it to the following document:{color}

{color:#0e101a}[https://geode.apache.org/docs/guide/114/managing/disk_storage/managing_disk_buffer_flushes.html]{color}

 

{color:#0e101a}Changing this parameter's default value to true would also be 
good. {color}{color:#0e101a}This parameter should not affect performance as the 
".if" file is not updated frequently.{color}


> The server fails to start because the .crf or the .drf file is missing
> --
>
> Key: GEODE-10339
> URL: https://issues.apache.org/jira/browse/GEODE-10339
> Project: Geode
>  Issue Type: Bug
>Reporter: Jakov Varenina
>Assignee: Jakov Varenina
>Priority: Major
>  Labels: pull-request-available

[jira] [Updated] (GEODE-10403) Distributed deadlock when stopping gateway sender

2022-09-06 Thread Alberto Gomez (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-10403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alberto Gomez updated GEODE-10403:
--
Fix Version/s: 1.16.0

> Distributed deadlock when stopping gateway sender
> -
>
> Key: GEODE-10403
> URL: https://issues.apache.org/jira/browse/GEODE-10403
> Project: Geode
>  Issue Type: Bug
>  Components: wan
>Affects Versions: 1.12.9, 1.13.8, 1.14.4, 1.15.0
>Reporter: Alberto Gomez
>Assignee: Alberto Gomez
>Priority: Major
>  Labels: needsTriage, pull-request-available
> Fix For: 1.16.0
>
>
> A distributed deadlock has been found during some tests of a Geode system 
> with WAN replication when stopping the gateway sender while sending a fair 
> amount of operations to the servers.
> The distributed deadlock manifests in the gateway sender stop command hanging 
> forever and by all normal Geode operations from clients (gets, puts,...) not 
> being responded.
> The situation is provoked by the Gateway sender stop command that first takes 
> the lifecycle lock and then, at a given point, tries to retrieve the size of 
> the gateway sender. This operation, that requires communication with the 
> other peers never finishes, probably because the response from one of the 
> peers is never received.
> Another thread is blocked when trying to acquire the lifecycle lock in 
> AbstractGatewaySender.distribute().
> Finally many threads handling Geode operations (get, put...) get blocked in 
> the DistributedCacheOperation._distribute() call waiting for a response from 
> another peer.
> Thread dump section from blocked gateway sender stop command in call to get 
> size of queue:
> {{"ConcurrentParallelGatewaySenderEventProcessor Stopper Thread1" #1316 
> daemon prio=10 os_prio=0 cpu=45.55ms elapsed=4152.76s tid=0x7f92bc1c2000 
> nid=0x2154 waiting on condition  [0x7f9179cd2000]}}
> {{   java.lang.Thread.State: TIMED_WAITING (parking)}}
> {{        at jdk.internal.misc.Unsafe.park(java.base@11.0.11/Native Method)}}
> {{        - parking to wait for  <0x00031ca2be50> (a 
> java.util.concurrent.CountDownLatch$Sync)}}
> {{        at 
> java.util.concurrent.locks.LockSupport.parkNanos(java.base@11.0.11/LockSupport.java:234)}}
> {{        at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(java.base@11.0.11/AbstractQueuedSynchronizer.java:1079)}}
> {{        at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(java.base@11.0.11/AbstractQueuedSynchronizer.java:1369)}}
> {{        at 
> java.util.concurrent.CountDownLatch.await(java.base@11.0.11/CountDownLatch.java:278)}}
> {{        at 
> org.apache.geode.internal.util.concurrent.StoppableCountDownLatch.await(StoppableCountDownLatch.java:72)}}
> {{        at 
> org.apache.geode.distributed.internal.ReplyProcessor21.basicWait(ReplyProcessor21.java:731)}}
> {{        at 
> org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:802)}}
> {{        at 
> org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:779)}}
> {{        at 
> org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:865)}}
> {{        at 
> org.apache.geode.internal.cache.partitioned.SizeMessage$SizeResponse.waitBucketSizes(SizeMessage.java:344)}}
> {{        at 
> org.apache.geode.internal.cache.PartitionedRegion.getSizeRemotely(PartitionedRegion.java:6758)}}
> {{        at 
> org.apache.geode.internal.cache.PartitionedRegion.entryCount(PartitionedRegion.java:6709)}}
> {{        at 
> org.apache.geode.internal.cache.PartitionedRegion.entryCount(PartitionedRegion.java:6691)}}
> {{        at 
> org.apache.geode.internal.cache.PartitionedRegion.getRegionSize(PartitionedRegion.java:6663)}}
> {{        at 
> org.apache.geode.internal.cache.LocalRegionDataView.entryCount(LocalRegionDataView.java:99)}}
> {{        at 
> org.apache.geode.internal.cache.LocalRegion.entryCount(LocalRegion.java:2078)}}
> {{        at 
> org.apache.geode.internal.cache.LocalRegion.size(LocalRegion.java:8301)}}
> {{        at 
> org.apache.geode.internal.cache.wan.parallel.ParallelGatewaySenderQueue.size(ParallelGatewaySenderQueue.java:1670)}}
> {{        at 
> org.apache.geode.internal.cache.wan.AbstractGatewaySenderEventProcessor.closeProcessor(AbstractGatewaySenderEventProcessor.java:1259)}}
> {{        at 
> org.apache.geode.internal.cache.wan.AbstractGatewaySenderEventProcessor.stopProcessing(AbstractGatewaySenderEventProcessor.java:1247)}}
> {{        at 
> org.apache.geode.internal.cache.wan.AbstractGatewaySenderEventProcessor$SenderStopperCallable.call(AbstractGatewaySenderEventProcessor.java:1399)}}
> {{        at 
> org.apache.geode.

[jira] [Updated] (GEODE-10340) Add new DiskStoreMXBean JMX metrics

2022-09-06 Thread Alberto Gomez (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-10340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alberto Gomez updated GEODE-10340:
--
Fix Version/s: 1.16.0

> Add new DiskStoreMXBean JMX metrics
> ---
>
> Key: GEODE-10340
> URL: https://issues.apache.org/jira/browse/GEODE-10340
> Project: Geode
>  Issue Type: New Feature
>  Components: persistence, statistics
>Reporter: Alberto Gomez
>Assignee: Alberto Gomez
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.16.0
>
>
> In order to be able to visualize the progress of oplog recovery at server 
> startup it would be nice that the recoveredEntryCreates, 
> recoveredEntryUpdates and recoveredEntryDestroys DiskStore stats are 
> published via JMX.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (GEODE-10300) C++ Native client messages coming from the locator cannot be longer than 3000 bytes

2022-09-06 Thread Alberto Gomez (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-10300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alberto Gomez updated GEODE-10300:
--
Fix Version/s: 1.16.0

> C++ Native client messages coming from the locator cannot be longer than 3000 
> bytes
> ---
>
> Key: GEODE-10300
> URL: https://issues.apache.org/jira/browse/GEODE-10300
> Project: Geode
>  Issue Type: Bug
>  Components: native client
>Reporter: Alberto Gomez
>Assignee: Alberto Gomez
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.16.0
>
>
> If a locator sends a response to the C++ native client that is longer than 
> 3000 bytes the C++ native client library will only read the first 3000 bytes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (GEODE-10323) OffHeapStorageJUnitTest testCreateOffHeapStorage fails with AssertionError: expected:<100> but was:<1048576>

2022-09-06 Thread Alberto Gomez (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-10323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alberto Gomez updated GEODE-10323:
--
Fix Version/s: 1.16.0

> OffHeapStorageJUnitTest testCreateOffHeapStorage fails with AssertionError: 
> expected:<100> but was:<1048576>
> 
>
> Key: GEODE-10323
> URL: https://issues.apache.org/jira/browse/GEODE-10323
> Project: Geode
>  Issue Type: Bug
>  Components: offheap
>Affects Versions: 1.16.0
>Reporter: Kirk Lund
>Assignee: Alberto Gomez
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.16.0
>
>
> [Please see 1st comment on this ticket below the description for 
> recommendation on how to fix.]
> {{OffHeapStorageJUnitTest testCreateOffHeapStorage}} started failing 
> intermittently due to commit a350ed2 which went in as 
> "[GEODE-10087|https://issues.apache.org/jira/browse/GEODE-10087]: Enhance 
> off-heap fragmentation visibility. 
> ([#7407|https://github.com/apache/geode/pull/7407/commits/1640effc5c760afa8d9ec4c2743950bb1de0ef8f])".
> The failure stack looks like:
> {noformat}
> OffHeapStorageJUnitTest > testCreateOffHeapStorage FAILED
> java.lang.AssertionError: expected:<100> but was:<1048576>
> at org.junit.Assert.fail(Assert.java:89)
> at org.junit.Assert.failNotEquals(Assert.java:835)
> at org.junit.Assert.assertEquals(Assert.java:647)
> at org.junit.Assert.assertEquals(Assert.java:633)
> at 
> org.apache.geode.internal.offheap.OffHeapStorageJUnitTest.testCreateOffHeapStorage(OffHeapStorageJUnitTest.java:220)
> {noformat}
> The cause is in {{MemoryAllocatorImpl}}. A new scheduled executor 
> {{updateNonRealTimeStatsExecutor}} was added near the bottom of the 
> constructor which immediately gets scheduled to invoke 
> {{freeList::updateNonRealTimeStats}} repeatedly at the specified frequency of 
> {{updateOffHeapStatsFrequencyMs}} whenever an instance of 
> {{MemoryAllocatorImpl}} is created.
> {{freeList::updateNonRealTimeStats}} then updates {{setLargestFragment}} and 
> {{setFreedChunks}}.
> Scheduling or starting any sort of threads from a constructor is considered a 
> bad practice and should only be done via some sort of activation method such 
> as {{start()}} which can be invoked from the static factory method for 
> {{MemoryAllocatorImpl}}. The unit tests would then continue to construct 
> instances via a constructor that does NOT invoke {{start()}}.
> {{OffHeapStorageJUnitTest testCreateOffHeapStorage}}, and possibly other 
> tests, is written with the assumption that no other thread or component can 
> or will be updating the {{OffHeapStats}}. Hence it is then safe for the test 
> to setLargestFragment and then assert that it has the value passed in:
> {noformat}
> 219:  stats.setLargestFragment(100);
> 220:  assertEquals(100, stats.getLargestFragment());
> {noformat}
> Line 220 is the source of the assertion failure because 
> {{updateNonRealTimeStatsExecutor}} is racing with the JUnit thread and 
> changes the value of {{setLargestFragment}} immediately after the test sets 
> the value to 100, but before the test invokes the assertion.
> Looking back at the [PR test 
> results|https://cio.hdb.gemfire-ci.info/commits/pr/7407/junit?days=90] we can 
> see that stress-new-test failed 5 times with this exact failure before being 
> merged to develop:
> * 
> http://files.apachegeode-ci.info/builds/apache-develop-pr/geode-pr-7407/test-results/repeatTest/1646140652/
> * 
> http://files.apachegeode-ci.info/builds/apache-develop-pr/geode-pr-7407/test-results/repeatTest/1646154134/
>  (x2)
> * 
> http://files.apachegeode-ci.info/builds/apache-develop-pr/geode-pr-7407/test-results/repeatTest/1646156997/
> * 
> http://files.apachegeode-ci.info/builds/apache-develop-pr/geode-pr-7407/test-results/repeatTest/1646170346/



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (GEODE-10371) C++ Native client: Improve dispersion on connections expiration

2022-09-06 Thread Alberto Gomez (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-10371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alberto Gomez updated GEODE-10371:
--
Fix Version/s: 1.16.0

> C++ Native client: Improve dispersion on connections expiration
> ---
>
> Key: GEODE-10371
> URL: https://issues.apache.org/jira/browse/GEODE-10371
> Project: Geode
>  Issue Type: Improvement
>  Components: native client
>Reporter: Alberto Gomez
>Assignee: Alberto Gomez
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.16.0
>
>
> The dispersion on connections expirations in the C++ native client works in 
> such a way that it adds a dispersion (variance) between -9% and 9% over the 
> time for a connection to expire due to load-conditioning so that, in the 
> event of having many connections being created at the same, they do not 
> expire at the right exact time.
> Nevertheless, the current implementation has two problems:
> - The randomness of the variance depends on the current time in seconds. As a 
> result, for connections created in the same second, the variance will be the 
> same and, therefore, the expiration time too.
> - The randomness is created using the C standard's library "rand()" function 
> which is considered not secure.
> It is recommended to change the library used to generate the random variance 
> to a secure one and also to make sure that for the time in seconds it does 
> not return the same variance.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (GEODE-10371) C++ Native client: Improve dispersion on connections expiration

2022-09-06 Thread Alberto Gomez (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-10371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alberto Gomez resolved GEODE-10371.
---
Resolution: Fixed

> C++ Native client: Improve dispersion on connections expiration
> ---
>
> Key: GEODE-10371
> URL: https://issues.apache.org/jira/browse/GEODE-10371
> Project: Geode
>  Issue Type: Improvement
>  Components: native client
>Reporter: Alberto Gomez
>Assignee: Alberto Gomez
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.16.0
>
>
> The dispersion on connections expirations in the C++ native client works in 
> such a way that it adds a dispersion (variance) between -9% and 9% over the 
> time for a connection to expire due to load-conditioning so that, in the 
> event of having many connections being created at the same, they do not 
> expire at the right exact time.
> Nevertheless, the current implementation has two problems:
> - The randomness of the variance depends on the current time in seconds. As a 
> result, for connections created in the same second, the variance will be the 
> same and, therefore, the expiration time too.
> - The randomness is created using the C standard's library "rand()" function 
> which is considered not secure.
> It is recommended to change the library used to generate the random variance 
> to a secure one and also to make sure that for the time in seconds it does 
> not return the same variance.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (GEODE-10403) Distributed deadlock when stopping gateway sender

2022-09-06 Thread Alberto Gomez (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-10403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alberto Gomez resolved GEODE-10403.
---
Resolution: Fixed

> Distributed deadlock when stopping gateway sender
> -
>
> Key: GEODE-10403
> URL: https://issues.apache.org/jira/browse/GEODE-10403
> Project: Geode
>  Issue Type: Bug
>  Components: wan
>Affects Versions: 1.12.9, 1.13.8, 1.14.4, 1.15.0
>Reporter: Alberto Gomez
>Assignee: Alberto Gomez
>Priority: Major
>  Labels: needsTriage, pull-request-available
> Fix For: 1.16.0
>
>
> A distributed deadlock has been found during some tests of a Geode system 
> with WAN replication when stopping the gateway sender while sending a fair 
> amount of operations to the servers.
> The distributed deadlock manifests in the gateway sender stop command hanging 
> forever and by all normal Geode operations from clients (gets, puts,...) not 
> being responded.
> The situation is provoked by the Gateway sender stop command that first takes 
> the lifecycle lock and then, at a given point, tries to retrieve the size of 
> the gateway sender. This operation, that requires communication with the 
> other peers never finishes, probably because the response from one of the 
> peers is never received.
> Another thread is blocked when trying to acquire the lifecycle lock in 
> AbstractGatewaySender.distribute().
> Finally many threads handling Geode operations (get, put...) get blocked in 
> the DistributedCacheOperation._distribute() call waiting for a response from 
> another peer.
> Thread dump section from blocked gateway sender stop command in call to get 
> size of queue:
> {{"ConcurrentParallelGatewaySenderEventProcessor Stopper Thread1" #1316 
> daemon prio=10 os_prio=0 cpu=45.55ms elapsed=4152.76s tid=0x7f92bc1c2000 
> nid=0x2154 waiting on condition  [0x7f9179cd2000]}}
> {{   java.lang.Thread.State: TIMED_WAITING (parking)}}
> {{        at jdk.internal.misc.Unsafe.park(java.base@11.0.11/Native Method)}}
> {{        - parking to wait for  <0x00031ca2be50> (a 
> java.util.concurrent.CountDownLatch$Sync)}}
> {{        at 
> java.util.concurrent.locks.LockSupport.parkNanos(java.base@11.0.11/LockSupport.java:234)}}
> {{        at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(java.base@11.0.11/AbstractQueuedSynchronizer.java:1079)}}
> {{        at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(java.base@11.0.11/AbstractQueuedSynchronizer.java:1369)}}
> {{        at 
> java.util.concurrent.CountDownLatch.await(java.base@11.0.11/CountDownLatch.java:278)}}
> {{        at 
> org.apache.geode.internal.util.concurrent.StoppableCountDownLatch.await(StoppableCountDownLatch.java:72)}}
> {{        at 
> org.apache.geode.distributed.internal.ReplyProcessor21.basicWait(ReplyProcessor21.java:731)}}
> {{        at 
> org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:802)}}
> {{        at 
> org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:779)}}
> {{        at 
> org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:865)}}
> {{        at 
> org.apache.geode.internal.cache.partitioned.SizeMessage$SizeResponse.waitBucketSizes(SizeMessage.java:344)}}
> {{        at 
> org.apache.geode.internal.cache.PartitionedRegion.getSizeRemotely(PartitionedRegion.java:6758)}}
> {{        at 
> org.apache.geode.internal.cache.PartitionedRegion.entryCount(PartitionedRegion.java:6709)}}
> {{        at 
> org.apache.geode.internal.cache.PartitionedRegion.entryCount(PartitionedRegion.java:6691)}}
> {{        at 
> org.apache.geode.internal.cache.PartitionedRegion.getRegionSize(PartitionedRegion.java:6663)}}
> {{        at 
> org.apache.geode.internal.cache.LocalRegionDataView.entryCount(LocalRegionDataView.java:99)}}
> {{        at 
> org.apache.geode.internal.cache.LocalRegion.entryCount(LocalRegion.java:2078)}}
> {{        at 
> org.apache.geode.internal.cache.LocalRegion.size(LocalRegion.java:8301)}}
> {{        at 
> org.apache.geode.internal.cache.wan.parallel.ParallelGatewaySenderQueue.size(ParallelGatewaySenderQueue.java:1670)}}
> {{        at 
> org.apache.geode.internal.cache.wan.AbstractGatewaySenderEventProcessor.closeProcessor(AbstractGatewaySenderEventProcessor.java:1259)}}
> {{        at 
> org.apache.geode.internal.cache.wan.AbstractGatewaySenderEventProcessor.stopProcessing(AbstractGatewaySenderEventProcessor.java:1247)}}
> {{        at 
> org.apache.geode.internal.cache.wan.AbstractGatewaySenderEventProcessor$SenderStopperCallable.call(AbstractGatewaySenderEventProcessor.java:1399)}}
> {{        at 
> org.apache.geode.in

[jira] [Resolved] (GEODE-10340) Add new DiskStoreMXBean JMX metrics

2022-09-06 Thread Alberto Gomez (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-10340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alberto Gomez resolved GEODE-10340.
---
Resolution: Fixed

> Add new DiskStoreMXBean JMX metrics
> ---
>
> Key: GEODE-10340
> URL: https://issues.apache.org/jira/browse/GEODE-10340
> Project: Geode
>  Issue Type: New Feature
>  Components: persistence, statistics
>Reporter: Alberto Gomez
>Assignee: Alberto Gomez
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.16.0
>
>
> In order to be able to visualize the progress of oplog recovery at server 
> startup it would be nice that the recoveredEntryCreates, 
> recoveredEntryUpdates and recoveredEntryDestroys DiskStore stats are 
> published via JMX.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (GEODE-10323) OffHeapStorageJUnitTest testCreateOffHeapStorage fails with AssertionError: expected:<100> but was:<1048576>

2022-09-06 Thread Alberto Gomez (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-10323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alberto Gomez resolved GEODE-10323.
---
Resolution: Fixed

> OffHeapStorageJUnitTest testCreateOffHeapStorage fails with AssertionError: 
> expected:<100> but was:<1048576>
> 
>
> Key: GEODE-10323
> URL: https://issues.apache.org/jira/browse/GEODE-10323
> Project: Geode
>  Issue Type: Bug
>  Components: offheap
>Affects Versions: 1.16.0
>Reporter: Kirk Lund
>Assignee: Alberto Gomez
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.16.0
>
>
> [Please see 1st comment on this ticket below the description for 
> recommendation on how to fix.]
> {{OffHeapStorageJUnitTest testCreateOffHeapStorage}} started failing 
> intermittently due to commit a350ed2 which went in as 
> "[GEODE-10087|https://issues.apache.org/jira/browse/GEODE-10087]: Enhance 
> off-heap fragmentation visibility. 
> ([#7407|https://github.com/apache/geode/pull/7407/commits/1640effc5c760afa8d9ec4c2743950bb1de0ef8f])".
> The failure stack looks like:
> {noformat}
> OffHeapStorageJUnitTest > testCreateOffHeapStorage FAILED
> java.lang.AssertionError: expected:<100> but was:<1048576>
> at org.junit.Assert.fail(Assert.java:89)
> at org.junit.Assert.failNotEquals(Assert.java:835)
> at org.junit.Assert.assertEquals(Assert.java:647)
> at org.junit.Assert.assertEquals(Assert.java:633)
> at 
> org.apache.geode.internal.offheap.OffHeapStorageJUnitTest.testCreateOffHeapStorage(OffHeapStorageJUnitTest.java:220)
> {noformat}
> The cause is in {{MemoryAllocatorImpl}}. A new scheduled executor 
> {{updateNonRealTimeStatsExecutor}} was added near the bottom of the 
> constructor which immediately gets scheduled to invoke 
> {{freeList::updateNonRealTimeStats}} repeatedly at the specified frequency of 
> {{updateOffHeapStatsFrequencyMs}} whenever an instance of 
> {{MemoryAllocatorImpl}} is created.
> {{freeList::updateNonRealTimeStats}} then updates {{setLargestFragment}} and 
> {{setFreedChunks}}.
> Scheduling or starting any sort of threads from a constructor is considered a 
> bad practice and should only be done via some sort of activation method such 
> as {{start()}} which can be invoked from the static factory method for 
> {{MemoryAllocatorImpl}}. The unit tests would then continue to construct 
> instances via a constructor that does NOT invoke {{start()}}.
> {{OffHeapStorageJUnitTest testCreateOffHeapStorage}}, and possibly other 
> tests, is written with the assumption that no other thread or component can 
> or will be updating the {{OffHeapStats}}. Hence it is then safe for the test 
> to setLargestFragment and then assert that it has the value passed in:
> {noformat}
> 219:  stats.setLargestFragment(100);
> 220:  assertEquals(100, stats.getLargestFragment());
> {noformat}
> Line 220 is the source of the assertion failure because 
> {{updateNonRealTimeStatsExecutor}} is racing with the JUnit thread and 
> changes the value of {{setLargestFragment}} immediately after the test sets 
> the value to 100, but before the test invokes the assertion.
> Looking back at the [PR test 
> results|https://cio.hdb.gemfire-ci.info/commits/pr/7407/junit?days=90] we can 
> see that stress-new-test failed 5 times with this exact failure before being 
> merged to develop:
> * 
> http://files.apachegeode-ci.info/builds/apache-develop-pr/geode-pr-7407/test-results/repeatTest/1646140652/
> * 
> http://files.apachegeode-ci.info/builds/apache-develop-pr/geode-pr-7407/test-results/repeatTest/1646154134/
>  (x2)
> * 
> http://files.apachegeode-ci.info/builds/apache-develop-pr/geode-pr-7407/test-results/repeatTest/1646156997/
> * 
> http://files.apachegeode-ci.info/builds/apache-develop-pr/geode-pr-7407/test-results/repeatTest/1646170346/



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (GEODE-10300) C++ Native client messages coming from the locator cannot be longer than 3000 bytes

2022-09-06 Thread Alberto Gomez (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-10300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alberto Gomez resolved GEODE-10300.
---
Resolution: Fixed

> C++ Native client messages coming from the locator cannot be longer than 3000 
> bytes
> ---
>
> Key: GEODE-10300
> URL: https://issues.apache.org/jira/browse/GEODE-10300
> Project: Geode
>  Issue Type: Bug
>  Components: native client
>Reporter: Alberto Gomez
>Assignee: Alberto Gomez
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.16.0
>
>
> If a locator sends a response to the C++ native client that is longer than 
> 3000 bytes the C++ native client library will only read the first 3000 bytes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (GEODE-10417) Fix NullPointerException when getting events from the gw sender queue to complete transactions

2022-09-06 Thread Alberto Gomez (Jira)
Alberto Gomez created GEODE-10417:
-

 Summary: Fix NullPointerException when getting events from the gw 
sender queue to complete transactions
 Key: GEODE-10417
 URL: https://issues.apache.org/jira/browse/GEODE-10417
 Project: Geode
  Issue Type: Bug
  Components: wan
Affects Versions: 1.15.0, 1.14.4, 1.13.8
Reporter: Alberto Gomez


When the WAN group-transaction-events feature is enabled, it is possible to get 
a NullPointerException when retrieving events from the queue to complete a 
transaction if the event in the queue is null.

If this situation is reached then the gateway sender dispatcher will not 
dispatch queue events anymore and therefore the WAN replication will not 
progress.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (GEODE-10417) Fix NullPointerException when getting events from the gw sender queue to complete transactions

2022-09-06 Thread Alexander Murmann (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-10417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Murmann updated GEODE-10417:
--
Labels: needsTriage  (was: )

> Fix NullPointerException when getting events from the gw sender queue to 
> complete transactions
> --
>
> Key: GEODE-10417
> URL: https://issues.apache.org/jira/browse/GEODE-10417
> Project: Geode
>  Issue Type: Bug
>  Components: wan
>Affects Versions: 1.13.8, 1.14.4, 1.15.0
>Reporter: Alberto Gomez
>Priority: Major
>  Labels: needsTriage
>
> When the WAN group-transaction-events feature is enabled, it is possible to 
> get a NullPointerException when retrieving events from the queue to complete 
> a transaction if the event in the queue is null.
> If this situation is reached then the gateway sender dispatcher will not 
> dispatch queue events anymore and therefore the WAN replication will not 
> progress.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (GEODE-10417) Fix NullPointerException when getting events from the gw sender queue to complete transactions

2022-09-06 Thread Alberto Gomez (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-10417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alberto Gomez reassigned GEODE-10417:
-

Assignee: Alberto Gomez

> Fix NullPointerException when getting events from the gw sender queue to 
> complete transactions
> --
>
> Key: GEODE-10417
> URL: https://issues.apache.org/jira/browse/GEODE-10417
> Project: Geode
>  Issue Type: Bug
>  Components: wan
>Affects Versions: 1.13.8, 1.14.4, 1.15.0
>Reporter: Alberto Gomez
>Assignee: Alberto Gomez
>Priority: Major
>  Labels: needsTriage
>
> When the WAN group-transaction-events feature is enabled, it is possible to 
> get a NullPointerException when retrieving events from the queue to complete 
> a transaction if the event in the queue is null.
> If this situation is reached then the gateway sender dispatcher will not 
> dispatch queue events anymore and therefore the WAN replication will not 
> progress.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (GEODE-10417) Fix NullPointerException when getting events from the gw sender queue to complete transactions

2022-09-06 Thread Alberto Gomez (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-10417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alberto Gomez updated GEODE-10417:
--
Description: 
When the WAN group-transaction-events feature is enabled in a parallel gateway 
sender, it is possible to get a NullPointerException when retrieving events 
from the queue to complete a transaction if the event in the queue is null.

If this situation is reached then the gateway sender dispatcher will not 
dispatch queue events anymore and therefore the WAN replication will not 
progress.

  was:
When the WAN group-transaction-events feature is enabled, it is possible to get 
a NullPointerException when retrieving events from the queue to complete a 
transaction if the event in the queue is null.

If this situation is reached then the gateway sender dispatcher will not 
dispatch queue events anymore and therefore the WAN replication will not 
progress.


> Fix NullPointerException when getting events from the gw sender queue to 
> complete transactions
> --
>
> Key: GEODE-10417
> URL: https://issues.apache.org/jira/browse/GEODE-10417
> Project: Geode
>  Issue Type: Bug
>  Components: wan
>Affects Versions: 1.13.8, 1.14.4, 1.15.0
>Reporter: Alberto Gomez
>Assignee: Alberto Gomez
>Priority: Major
>  Labels: needsTriage
>
> When the WAN group-transaction-events feature is enabled in a parallel 
> gateway sender, it is possible to get a NullPointerException when retrieving 
> events from the queue to complete a transaction if the event in the queue is 
> null.
> If this situation is reached then the gateway sender dispatcher will not 
> dispatch queue events anymore and therefore the WAN replication will not 
> progress.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (GEODE-10417) Fix NullPointerException when getting events from the gw sender queue to complete transactions

2022-09-06 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-10417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated GEODE-10417:
---
Labels: needsTriage pull-request-available  (was: needsTriage)

> Fix NullPointerException when getting events from the gw sender queue to 
> complete transactions
> --
>
> Key: GEODE-10417
> URL: https://issues.apache.org/jira/browse/GEODE-10417
> Project: Geode
>  Issue Type: Bug
>  Components: wan
>Affects Versions: 1.13.8, 1.14.4, 1.15.0
>Reporter: Alberto Gomez
>Assignee: Alberto Gomez
>Priority: Major
>  Labels: needsTriage, pull-request-available
>
> When the WAN group-transaction-events feature is enabled in a parallel 
> gateway sender, it is possible to get a NullPointerException when retrieving 
> events from the queue to complete a transaction if the event in the queue is 
> null.
> If this situation is reached then the gateway sender dispatcher will not 
> dispatch queue events anymore and therefore the WAN replication will not 
> progress.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)