[jira] [Commented] (GEODE-10395) TXLockIdImpl memory leak after CommitConflictException from another node
[ https://issues.apache.org/jira/browse/GEODE-10395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17607318#comment-17607318 ] ASF subversion and git services commented on GEODE-10395: - Commit 4cb75ae4848250606db2f4b14300601755586192 in geode's branch refs/heads/develop from Mario Kevo [ https://gitbox.apache.org/repos/asf?p=geode.git;h=4cb75ae484 ] GEODE-10395 remove locks from List if dlock.acquireTryLocks return false (#7846) > TXLockIdImpl memory leak after CommitConflictException from another node > > > Key: GEODE-10395 > URL: https://issues.apache.org/jira/browse/GEODE-10395 > Project: Geode > Issue Type: Bug >Affects Versions: 1.14.0, 1.15.0 >Reporter: Eugene Nedzvetsky >Assignee: Mario Kevo >Priority: Major > Labels: pull-request-available > > org.apache.geode.internal.cache.locks.TXLockServiceImpl#txLock:120 adds > TXLockIdImpl objects to TXLockServiceImpl#txLockIdList. > {code:java} > synchronized (txLockIdList) { > txLockId = new TXLockIdImpl(dlock.getDistributionManager().getId()); > txLockIdList.add(txLockId); > } > {code} > These objects will not be removed from this list if dlock.acquireTryLocks > returned false. > {code:java} > gotLocks = dlock.acquireTryLocks(batch, TIMEOUT_MILLIS, LEASE_MILLIS, > keyIfFail); > if (gotLocks) { // ...otherwise race can occur between tryLocks and > readLock > acquireRecoveryReadLock(); > } else if (keyIfFail[0] != null) { > throw new CommitConflictException( > String.format("Concurrent transaction commit detected %s", > keyIfFail[0])); > } else { > throw new CommitConflictException( > String.format("Failed to request try locks from grantor: %s", > dlock.getLockGrantorId())); > } > {code} > It throws CommitConflictException and after that system doesn't have a > txLockId reference and this txLockId will be never removed from this list. > It produces critical performance degradation. txLockIdList has a few hundred > thousand txLocks after a few weeks and TXLockServiceImpl#release iterates > this list 2 times on every tx commit and the same time "synchronized > (txLockIdList)" locks other threads. > TXLockIdImpl#equals works really slow because it checks bunch of variables in > memberId.equals(that.memberId). > {code:java} > public void release(TXLockId txLockId) { > synchronized (txLockIdList) { > if (!txLockIdList.contains(txLockId)) { > // TXLockService.destroyServices can be invoked in cache.close(). > // Other P2P threads could process message such as TXCommitMessage > afterwards, > // and invoke TXLockService.createDTLS(). It could create a new > TXLockService > // which will have a new empty list (txLockIdList) and it will not > // contain the originally added txLockId > throw new IllegalArgumentException( > String.format("Invalid txLockId not found: %s", > txLockId)); > } > dlock.releaseTryLocks(txLockId, () -> { > return recovering; > }); > txLockIdList.remove(txLockId); > releaseRecoveryReadLock(); > } > } > {code} > I think TXLockServiceImpl#txLock should remove this txLockId from > TXLockServiceImpl#txLockIdList in case of CommitConflictException: > {code:java} > if (gotLocks) { // ...otherwise race can occur between tryLocks and readLock > acquireRecoveryReadLock(); > } else if (keyIfFail[0] != null) { > synchronized (this.txLockIdList) { > this.txLockIdList.remove(txLockId); > } > throw new CommitConflictException( > String.format("Concurrent transaction commit detected > %s", > keyIfFail[0])); > } else { > synchronized (this.txLockIdList) { > this.txLockIdList.remove(txLockId); > } > throw new CommitConflictException( > String.format("Failed to request try locks from > grantor: %s", > this.dlock.getLockGrantorId())); > } > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (GEODE-10410) Rebalance Guard Prevent Lost Bucket Recovery
[ https://issues.apache.org/jira/browse/GEODE-10410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17607337#comment-17607337 ] ASF subversion and git services commented on GEODE-10410: - Commit 67ebd727bef5c613bfe2aaf4258a5472ac433978 in geode's branch refs/heads/develop from WeijieEST [ https://gitbox.apache.org/repos/asf?p=geode.git;h=67ebd727be ] GEODE-10410: Fix bucket lost during rebalance (#7857) * GEODE-10410: Fix bucket lost during rebalance * improve test case name * improve test case comments and test case names > Rebalance Guard Prevent Lost Bucket Recovery > > > Key: GEODE-10410 > URL: https://issues.apache.org/jira/browse/GEODE-10410 > Project: Geode > Issue Type: Bug >Reporter: Weijie Xu >Assignee: Weijie Xu >Priority: Major > Labels: needsTriage, pull-request-available > Attachments: server2.log, test.tar.gz > > > Following steps reproduce the issue: > Run the start.gfsh in the attached example, which configures a geode system > with a partitioned region and a gateway sender. So there are two regions, the > manually created region, and the queue region. > Then run the example code, which will source ~400M data and 5 times amount of > events into the system. All data are sourced into the system, no bucket lost, > and no out of memory. > Then stop one of the server, and revoke the disk file of the server. > Then start the server, which will trigger a bucket recovery. After that, > there will be part of secondary bucket lost. > gfsh>show metrics --region=/example-region > | numBucketsWithoutRedundancy | 63 > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (GEODE-10410) Rebalance Guard Prevent Lost Bucket Recovery
[ https://issues.apache.org/jira/browse/GEODE-10410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weijie Xu resolved GEODE-10410. --- Resolution: Fixed > Rebalance Guard Prevent Lost Bucket Recovery > > > Key: GEODE-10410 > URL: https://issues.apache.org/jira/browse/GEODE-10410 > Project: Geode > Issue Type: Bug >Reporter: Weijie Xu >Assignee: Weijie Xu >Priority: Major > Labels: needsTriage, pull-request-available > Attachments: server2.log, test.tar.gz > > > Following steps reproduce the issue: > Run the start.gfsh in the attached example, which configures a geode system > with a partitioned region and a gateway sender. So there are two regions, the > manually created region, and the queue region. > Then run the example code, which will source ~400M data and 5 times amount of > events into the system. All data are sourced into the system, no bucket lost, > and no out of memory. > Then stop one of the server, and revoke the disk file of the server. > Then start the server, which will trigger a bucket recovery. After that, > there will be part of secondary bucket lost. > gfsh>show metrics --region=/example-region > | numBucketsWithoutRedundancy | 63 > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (GEODE-10423) Document the system property “ON_DISCONNECT_CLEAR_PDXTYPEIDS“
[ https://issues.apache.org/jira/browse/GEODE-10423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Zhang updated GEODE-10423: -- Priority: Minor (was: Major) > Document the system property “ON_DISCONNECT_CLEAR_PDXTYPEIDS“ > - > > Key: GEODE-10423 > URL: https://issues.apache.org/jira/browse/GEODE-10423 > Project: Geode > Issue Type: Improvement > Components: docs >Reporter: Tim Zhang >Priority: Minor > Labels: pull-request-available > > Document the java system property “ON_DISCONNECT_CLEAR_PDXTYPEIDS“, this > property is used by Java client. add instructions for using this property. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (GEODE-10395) TXLockIdImpl memory leak after CommitConflictException from another node
[ https://issues.apache.org/jira/browse/GEODE-10395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mario Kevo resolved GEODE-10395. Fix Version/s: 1.16.0 Resolution: Fixed > TXLockIdImpl memory leak after CommitConflictException from another node > > > Key: GEODE-10395 > URL: https://issues.apache.org/jira/browse/GEODE-10395 > Project: Geode > Issue Type: Bug >Affects Versions: 1.14.0, 1.15.0 >Reporter: Eugene Nedzvetsky >Assignee: Mario Kevo >Priority: Major > Labels: pull-request-available > Fix For: 1.16.0 > > > org.apache.geode.internal.cache.locks.TXLockServiceImpl#txLock:120 adds > TXLockIdImpl objects to TXLockServiceImpl#txLockIdList. > {code:java} > synchronized (txLockIdList) { > txLockId = new TXLockIdImpl(dlock.getDistributionManager().getId()); > txLockIdList.add(txLockId); > } > {code} > These objects will not be removed from this list if dlock.acquireTryLocks > returned false. > {code:java} > gotLocks = dlock.acquireTryLocks(batch, TIMEOUT_MILLIS, LEASE_MILLIS, > keyIfFail); > if (gotLocks) { // ...otherwise race can occur between tryLocks and > readLock > acquireRecoveryReadLock(); > } else if (keyIfFail[0] != null) { > throw new CommitConflictException( > String.format("Concurrent transaction commit detected %s", > keyIfFail[0])); > } else { > throw new CommitConflictException( > String.format("Failed to request try locks from grantor: %s", > dlock.getLockGrantorId())); > } > {code} > It throws CommitConflictException and after that system doesn't have a > txLockId reference and this txLockId will be never removed from this list. > It produces critical performance degradation. txLockIdList has a few hundred > thousand txLocks after a few weeks and TXLockServiceImpl#release iterates > this list 2 times on every tx commit and the same time "synchronized > (txLockIdList)" locks other threads. > TXLockIdImpl#equals works really slow because it checks bunch of variables in > memberId.equals(that.memberId). > {code:java} > public void release(TXLockId txLockId) { > synchronized (txLockIdList) { > if (!txLockIdList.contains(txLockId)) { > // TXLockService.destroyServices can be invoked in cache.close(). > // Other P2P threads could process message such as TXCommitMessage > afterwards, > // and invoke TXLockService.createDTLS(). It could create a new > TXLockService > // which will have a new empty list (txLockIdList) and it will not > // contain the originally added txLockId > throw new IllegalArgumentException( > String.format("Invalid txLockId not found: %s", > txLockId)); > } > dlock.releaseTryLocks(txLockId, () -> { > return recovering; > }); > txLockIdList.remove(txLockId); > releaseRecoveryReadLock(); > } > } > {code} > I think TXLockServiceImpl#txLock should remove this txLockId from > TXLockServiceImpl#txLockIdList in case of CommitConflictException: > {code:java} > if (gotLocks) { // ...otherwise race can occur between tryLocks and readLock > acquireRecoveryReadLock(); > } else if (keyIfFail[0] != null) { > synchronized (this.txLockIdList) { > this.txLockIdList.remove(txLockId); > } > throw new CommitConflictException( > String.format("Concurrent transaction commit detected > %s", > keyIfFail[0])); > } else { > synchronized (this.txLockIdList) { > this.txLockIdList.remove(txLockId); > } > throw new CommitConflictException( > String.format("Failed to request try locks from > grantor: %s", > this.dlock.getLockGrantorId())); > } > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)