[jira] [Commented] (GEODE-10395) TXLockIdImpl memory leak after CommitConflictException from another node

2022-09-20 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/GEODE-10395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17607318#comment-17607318
 ] 

ASF subversion and git services commented on GEODE-10395:
-

Commit 4cb75ae4848250606db2f4b14300601755586192 in geode's branch 
refs/heads/develop from Mario Kevo
[ https://gitbox.apache.org/repos/asf?p=geode.git;h=4cb75ae484 ]

GEODE-10395 remove locks from List if dlock.acquireTryLocks return false (#7846)



> TXLockIdImpl memory leak after CommitConflictException from another node
> 
>
> Key: GEODE-10395
> URL: https://issues.apache.org/jira/browse/GEODE-10395
> Project: Geode
>  Issue Type: Bug
>Affects Versions: 1.14.0, 1.15.0
>Reporter: Eugene Nedzvetsky
>Assignee: Mario Kevo
>Priority: Major
>  Labels: pull-request-available
>
> org.apache.geode.internal.cache.locks.TXLockServiceImpl#txLock:120 adds 
> TXLockIdImpl  objects to TXLockServiceImpl#txLockIdList. 
> {code:java}
> synchronized (txLockIdList) {
> txLockId = new TXLockIdImpl(dlock.getDistributionManager().getId());
> txLockIdList.add(txLockId);
>   }
> {code}
> These objects will not be removed from this list if dlock.acquireTryLocks 
> returned false.
> {code:java}
>   gotLocks = dlock.acquireTryLocks(batch, TIMEOUT_MILLIS, LEASE_MILLIS, 
> keyIfFail);
>   if (gotLocks) { // ...otherwise race can occur between tryLocks and 
> readLock
> acquireRecoveryReadLock();
>   } else if (keyIfFail[0] != null) {
> throw new CommitConflictException(
> String.format("Concurrent transaction commit detected %s",
> keyIfFail[0]));
>   } else {
> throw new CommitConflictException(
> String.format("Failed to request try locks from grantor: %s",
> dlock.getLockGrantorId()));
>   }
> {code}
> It throws CommitConflictException and after that system doesn't have a 
> txLockId reference and this txLockId will be never removed from this list.
> It produces critical performance degradation. txLockIdList has a few hundred 
> thousand txLocks after a few weeks and TXLockServiceImpl#release iterates 
> this list 2 times on every tx commit and the same time "synchronized 
> (txLockIdList)" locks other threads.
> TXLockIdImpl#equals works really slow because it checks bunch of variables in 
> memberId.equals(that.memberId).
> {code:java}
> public void release(TXLockId txLockId) {
> synchronized (txLockIdList) {
>   if (!txLockIdList.contains(txLockId)) {
> // TXLockService.destroyServices can be invoked in cache.close().
> // Other P2P threads could process message such as TXCommitMessage 
> afterwards,
> // and invoke TXLockService.createDTLS(). It could create a new 
> TXLockService
> // which will have a new empty list (txLockIdList) and it will not
> // contain the originally added txLockId
> throw new IllegalArgumentException(
> String.format("Invalid txLockId not found: %s",
> txLockId));
>   }
>   dlock.releaseTryLocks(txLockId, () -> {
> return recovering;
>   });
>   txLockIdList.remove(txLockId);
>   releaseRecoveryReadLock();
> }
>   }
> {code}
> I think TXLockServiceImpl#txLock should remove this txLockId from 
> TXLockServiceImpl#txLockIdList in case of CommitConflictException:
> {code:java}
>  if (gotLocks) { // ...otherwise race can occur between tryLocks and readLock
> acquireRecoveryReadLock();
> } else if (keyIfFail[0] != null) {
> synchronized (this.txLockIdList) {
> this.txLockIdList.remove(txLockId);
> }
> throw new CommitConflictException(
> String.format("Concurrent transaction commit detected 
> %s",
> keyIfFail[0]));
> } else {
> synchronized (this.txLockIdList) {
> this.txLockIdList.remove(txLockId);
> }
> throw new CommitConflictException(
> String.format("Failed to request try locks from 
> grantor: %s",
> this.dlock.getLockGrantorId()));
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (GEODE-10410) Rebalance Guard Prevent Lost Bucket Recovery

2022-09-20 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/GEODE-10410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17607337#comment-17607337
 ] 

ASF subversion and git services commented on GEODE-10410:
-

Commit 67ebd727bef5c613bfe2aaf4258a5472ac433978 in geode's branch 
refs/heads/develop from WeijieEST
[ https://gitbox.apache.org/repos/asf?p=geode.git;h=67ebd727be ]

GEODE-10410: Fix bucket lost during rebalance (#7857)

* GEODE-10410: Fix bucket lost during rebalance

* improve test case name

* improve test case comments and test case names

> Rebalance Guard Prevent Lost Bucket Recovery
> 
>
> Key: GEODE-10410
> URL: https://issues.apache.org/jira/browse/GEODE-10410
> Project: Geode
>  Issue Type: Bug
>Reporter: Weijie Xu
>Assignee: Weijie Xu
>Priority: Major
>  Labels: needsTriage, pull-request-available
> Attachments: server2.log, test.tar.gz
>
>
> Following steps reproduce the issue:
> Run the start.gfsh in the attached example, which configures a geode system 
> with a partitioned region and a gateway sender. So there are two regions, the 
> manually created region, and the queue region.
> Then run the example code, which will source ~400M data and 5 times amount of 
> events into the system. All data are sourced into the system, no bucket lost, 
> and no out of memory.
> Then stop one of the server, and revoke the disk file of the server.
> Then start the server, which will trigger a bucket recovery. After that, 
> there will be part of secondary bucket lost.
> gfsh>show metrics --region=/example-region
>           | numBucketsWithoutRedundancy  | 63
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (GEODE-10410) Rebalance Guard Prevent Lost Bucket Recovery

2022-09-20 Thread Weijie Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-10410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weijie Xu resolved GEODE-10410.
---
Resolution: Fixed

> Rebalance Guard Prevent Lost Bucket Recovery
> 
>
> Key: GEODE-10410
> URL: https://issues.apache.org/jira/browse/GEODE-10410
> Project: Geode
>  Issue Type: Bug
>Reporter: Weijie Xu
>Assignee: Weijie Xu
>Priority: Major
>  Labels: needsTriage, pull-request-available
> Attachments: server2.log, test.tar.gz
>
>
> Following steps reproduce the issue:
> Run the start.gfsh in the attached example, which configures a geode system 
> with a partitioned region and a gateway sender. So there are two regions, the 
> manually created region, and the queue region.
> Then run the example code, which will source ~400M data and 5 times amount of 
> events into the system. All data are sourced into the system, no bucket lost, 
> and no out of memory.
> Then stop one of the server, and revoke the disk file of the server.
> Then start the server, which will trigger a bucket recovery. After that, 
> there will be part of secondary bucket lost.
> gfsh>show metrics --region=/example-region
>           | numBucketsWithoutRedundancy  | 63
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (GEODE-10423) Document the system property “ON_DISCONNECT_CLEAR_PDXTYPEIDS“

2022-09-20 Thread Tim Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-10423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Zhang updated GEODE-10423:
--
Priority: Minor  (was: Major)

> Document the system property “ON_DISCONNECT_CLEAR_PDXTYPEIDS“
> -
>
> Key: GEODE-10423
> URL: https://issues.apache.org/jira/browse/GEODE-10423
> Project: Geode
>  Issue Type: Improvement
>  Components: docs
>Reporter: Tim Zhang
>Priority: Minor
>  Labels: pull-request-available
>
> Document the java system property “ON_DISCONNECT_CLEAR_PDXTYPEIDS“, this 
> property is used by Java client. add instructions for using this property.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (GEODE-10395) TXLockIdImpl memory leak after CommitConflictException from another node

2022-09-20 Thread Mario Kevo (Jira)


 [ 
https://issues.apache.org/jira/browse/GEODE-10395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mario Kevo resolved GEODE-10395.

Fix Version/s: 1.16.0
   Resolution: Fixed

> TXLockIdImpl memory leak after CommitConflictException from another node
> 
>
> Key: GEODE-10395
> URL: https://issues.apache.org/jira/browse/GEODE-10395
> Project: Geode
>  Issue Type: Bug
>Affects Versions: 1.14.0, 1.15.0
>Reporter: Eugene Nedzvetsky
>Assignee: Mario Kevo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.16.0
>
>
> org.apache.geode.internal.cache.locks.TXLockServiceImpl#txLock:120 adds 
> TXLockIdImpl  objects to TXLockServiceImpl#txLockIdList. 
> {code:java}
> synchronized (txLockIdList) {
> txLockId = new TXLockIdImpl(dlock.getDistributionManager().getId());
> txLockIdList.add(txLockId);
>   }
> {code}
> These objects will not be removed from this list if dlock.acquireTryLocks 
> returned false.
> {code:java}
>   gotLocks = dlock.acquireTryLocks(batch, TIMEOUT_MILLIS, LEASE_MILLIS, 
> keyIfFail);
>   if (gotLocks) { // ...otherwise race can occur between tryLocks and 
> readLock
> acquireRecoveryReadLock();
>   } else if (keyIfFail[0] != null) {
> throw new CommitConflictException(
> String.format("Concurrent transaction commit detected %s",
> keyIfFail[0]));
>   } else {
> throw new CommitConflictException(
> String.format("Failed to request try locks from grantor: %s",
> dlock.getLockGrantorId()));
>   }
> {code}
> It throws CommitConflictException and after that system doesn't have a 
> txLockId reference and this txLockId will be never removed from this list.
> It produces critical performance degradation. txLockIdList has a few hundred 
> thousand txLocks after a few weeks and TXLockServiceImpl#release iterates 
> this list 2 times on every tx commit and the same time "synchronized 
> (txLockIdList)" locks other threads.
> TXLockIdImpl#equals works really slow because it checks bunch of variables in 
> memberId.equals(that.memberId).
> {code:java}
> public void release(TXLockId txLockId) {
> synchronized (txLockIdList) {
>   if (!txLockIdList.contains(txLockId)) {
> // TXLockService.destroyServices can be invoked in cache.close().
> // Other P2P threads could process message such as TXCommitMessage 
> afterwards,
> // and invoke TXLockService.createDTLS(). It could create a new 
> TXLockService
> // which will have a new empty list (txLockIdList) and it will not
> // contain the originally added txLockId
> throw new IllegalArgumentException(
> String.format("Invalid txLockId not found: %s",
> txLockId));
>   }
>   dlock.releaseTryLocks(txLockId, () -> {
> return recovering;
>   });
>   txLockIdList.remove(txLockId);
>   releaseRecoveryReadLock();
> }
>   }
> {code}
> I think TXLockServiceImpl#txLock should remove this txLockId from 
> TXLockServiceImpl#txLockIdList in case of CommitConflictException:
> {code:java}
>  if (gotLocks) { // ...otherwise race can occur between tryLocks and readLock
> acquireRecoveryReadLock();
> } else if (keyIfFail[0] != null) {
> synchronized (this.txLockIdList) {
> this.txLockIdList.remove(txLockId);
> }
> throw new CommitConflictException(
> String.format("Concurrent transaction commit detected 
> %s",
> keyIfFail[0]));
> } else {
> synchronized (this.txLockIdList) {
> this.txLockIdList.remove(txLockId);
> }
> throw new CommitConflictException(
> String.format("Failed to request try locks from 
> grantor: %s",
> this.dlock.getLockGrantorId()));
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)