[jira] [Assigned] (GEODE-8687) Durable client is continuously re-registering CQs on all servers when event de-serialization fails causing resource exhaustion on servers
[ https://issues.apache.org/jira/browse/GEODE-8687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jakov Varenina reassigned GEODE-8687: - Assignee: Jakov Varenina > Durable client is continuously re-registering CQs on all servers when event > de-serialization fails causing resource exhaustion on servers > -- > > Key: GEODE-8687 > URL: https://issues.apache.org/jira/browse/GEODE-8687 > Project: Geode > Issue Type: Bug > Components: client/server >Affects Versions: 1.13.0 >Reporter: Jakov Varenina >Assignee: Jakov Varenina >Priority: Major > Attachments: deserialzationFault.log > > > When ReflectionBasedAutoSerializer is wrongly/not set it results with > serialization exception on client at the reception of the CQ events. > Serialization exception isn't logged which is misleading, and is hard to find > that actually ReflectionBasedAutoSerializer isn't set correctly. Only log > that can be seen is that client/servers subscription connections are closed > due to EOF. This is because client destroys subscriptions connections > intentionally, but doesn't log reason (PdxSerializationException) that led to > this. It would be good that serialization exceptions are logged as error or > warn. > Client destroys subscription connection and perform server fail-over whenever > serialization issue occurs. Additionally when subscription connection for > particular server fails multiple times then this server is put in deny list > for 10 seconds (this is configurable with {{ping-interval}}). After 10s > expire the server is removed from list and it is available for subscription > connection which will be destroyed again due serialization issue. This will > go indefinitely and approx. every 10s in this case the client subscribes to > each servers at least once. Due to serialization issue events aren't sent to > client and remain in subscription queues. > Whenever connection fails due to serialization issue and client is not > durable then subscription queue is closed and events are lost. > The biggest problem arises when client is durable. This is because > subscription queue remains on server for configurable period of time (e.g. > 300s) waiting for client to reconnect. When client perform fail-over to > another server it will create new subscription queue using initial image from > old queue that is currently paused. This means that all events from old queue > will be transferred to new subscription queue hosted by the current primary > server. This will happen on all servers and all of them will have copy of the > queue even subscription redundancy isn't configured. The problem here is that > client will periodically (every 10s in this case) establish connection to > each servers, so configured timeout (e.g. 300s) will never expire, but it > will be renewed each time client is registered. This could cause a lots of > problems since memory and disk usage (if overflow on queue is configured) > will increase on all servers. > You can find in attached logs for the problematic case with durable client : > vm0 -> locator > vm1, vm2 -> servers > vm3 -> durable client with enabled subscription handling CQ > events > vm4 -> client generating traffic that should trigger registered > CQ > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-8547) Command "show missing-disk-stores" not working, when all servers are down
[ https://issues.apache.org/jira/browse/GEODE-8547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226620#comment-17226620 ] ASF GitHub Bot commented on GEODE-8547: --- mivanac merged pull request #5567: URL: https://github.com/apache/geode/pull/5567 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Command "show missing-disk-stores" not working, when all servers are down > - > > Key: GEODE-8547 > URL: https://issues.apache.org/jira/browse/GEODE-8547 > Project: Geode > Issue Type: Bug > Components: gfsh >Reporter: Mario Ivanac >Assignee: Mario Ivanac >Priority: Major > Labels: needs-review, pull-request-available > > If cluster with 2 locators and 2 servers was ungracefully shutdown it can > happen that locators that are able to start up are not having most recent > data to bring up Cluster Configuration Service. > If we excute command "show missing-disk-stores" it will not work, since all > servers are down, > so we are stuck in this situation. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (GEODE-8547) Command "show missing-disk-stores" not working, when all servers are down
[ https://issues.apache.org/jira/browse/GEODE-8547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mario Ivanac resolved GEODE-8547. - Fix Version/s: 1.14.0 Resolution: Fixed > Command "show missing-disk-stores" not working, when all servers are down > - > > Key: GEODE-8547 > URL: https://issues.apache.org/jira/browse/GEODE-8547 > Project: Geode > Issue Type: Bug > Components: gfsh >Reporter: Mario Ivanac >Assignee: Mario Ivanac >Priority: Major > Labels: needs-review, pull-request-available > Fix For: 1.14.0 > > > If cluster with 2 locators and 2 servers was ungracefully shutdown it can > happen that locators that are able to start up are not having most recent > data to bring up Cluster Configuration Service. > If we excute command "show missing-disk-stores" it will not work, since all > servers are down, > so we are stuck in this situation. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-8547) Command "show missing-disk-stores" not working, when all servers are down
[ https://issues.apache.org/jira/browse/GEODE-8547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226626#comment-17226626 ] ASF subversion and git services commented on GEODE-8547: Commit 7cc14eef52e06fe1e8c56bd766df56297b9c9ff8 in geode's branch refs/heads/develop from Mario Ivanac [ https://gitbox.apache.org/repos/asf?p=geode.git;h=7cc14ee ] GEODE-8547: Added impacts to show missing disk-stores (#5567) * GEODE-8547: Added impacts to show missing disk-stores * GEODE-8547: Added DUnit test * GEODE-8547: update after comments * GEODE-8547: remove unused variables * GEODE-8547: update test > Command "show missing-disk-stores" not working, when all servers are down > - > > Key: GEODE-8547 > URL: https://issues.apache.org/jira/browse/GEODE-8547 > Project: Geode > Issue Type: Bug > Components: gfsh >Reporter: Mario Ivanac >Assignee: Mario Ivanac >Priority: Major > Labels: needs-review, pull-request-available > Fix For: 1.14.0 > > > If cluster with 2 locators and 2 servers was ungracefully shutdown it can > happen that locators that are able to start up are not having most recent > data to bring up Cluster Configuration Service. > If we excute command "show missing-disk-stores" it will not work, since all > servers are down, > so we are stuck in this situation. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-8547) Command "show missing-disk-stores" not working, when all servers are down
[ https://issues.apache.org/jira/browse/GEODE-8547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226621#comment-17226621 ] ASF subversion and git services commented on GEODE-8547: Commit 7cc14eef52e06fe1e8c56bd766df56297b9c9ff8 in geode's branch refs/heads/develop from Mario Ivanac [ https://gitbox.apache.org/repos/asf?p=geode.git;h=7cc14ee ] GEODE-8547: Added impacts to show missing disk-stores (#5567) * GEODE-8547: Added impacts to show missing disk-stores * GEODE-8547: Added DUnit test * GEODE-8547: update after comments * GEODE-8547: remove unused variables * GEODE-8547: update test > Command "show missing-disk-stores" not working, when all servers are down > - > > Key: GEODE-8547 > URL: https://issues.apache.org/jira/browse/GEODE-8547 > Project: Geode > Issue Type: Bug > Components: gfsh >Reporter: Mario Ivanac >Assignee: Mario Ivanac >Priority: Major > Labels: needs-review, pull-request-available > Fix For: 1.14.0 > > > If cluster with 2 locators and 2 servers was ungracefully shutdown it can > happen that locators that are able to start up are not having most recent > data to bring up Cluster Configuration Service. > If we excute command "show missing-disk-stores" it will not work, since all > servers are down, > so we are stuck in this situation. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-8547) Command "show missing-disk-stores" not working, when all servers are down
[ https://issues.apache.org/jira/browse/GEODE-8547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226623#comment-17226623 ] ASF subversion and git services commented on GEODE-8547: Commit 7cc14eef52e06fe1e8c56bd766df56297b9c9ff8 in geode's branch refs/heads/develop from Mario Ivanac [ https://gitbox.apache.org/repos/asf?p=geode.git;h=7cc14ee ] GEODE-8547: Added impacts to show missing disk-stores (#5567) * GEODE-8547: Added impacts to show missing disk-stores * GEODE-8547: Added DUnit test * GEODE-8547: update after comments * GEODE-8547: remove unused variables * GEODE-8547: update test > Command "show missing-disk-stores" not working, when all servers are down > - > > Key: GEODE-8547 > URL: https://issues.apache.org/jira/browse/GEODE-8547 > Project: Geode > Issue Type: Bug > Components: gfsh >Reporter: Mario Ivanac >Assignee: Mario Ivanac >Priority: Major > Labels: needs-review, pull-request-available > Fix For: 1.14.0 > > > If cluster with 2 locators and 2 servers was ungracefully shutdown it can > happen that locators that are able to start up are not having most recent > data to bring up Cluster Configuration Service. > If we excute command "show missing-disk-stores" it will not work, since all > servers are down, > so we are stuck in this situation. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-8547) Command "show missing-disk-stores" not working, when all servers are down
[ https://issues.apache.org/jira/browse/GEODE-8547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226625#comment-17226625 ] ASF subversion and git services commented on GEODE-8547: Commit 7cc14eef52e06fe1e8c56bd766df56297b9c9ff8 in geode's branch refs/heads/develop from Mario Ivanac [ https://gitbox.apache.org/repos/asf?p=geode.git;h=7cc14ee ] GEODE-8547: Added impacts to show missing disk-stores (#5567) * GEODE-8547: Added impacts to show missing disk-stores * GEODE-8547: Added DUnit test * GEODE-8547: update after comments * GEODE-8547: remove unused variables * GEODE-8547: update test > Command "show missing-disk-stores" not working, when all servers are down > - > > Key: GEODE-8547 > URL: https://issues.apache.org/jira/browse/GEODE-8547 > Project: Geode > Issue Type: Bug > Components: gfsh >Reporter: Mario Ivanac >Assignee: Mario Ivanac >Priority: Major > Labels: needs-review, pull-request-available > Fix For: 1.14.0 > > > If cluster with 2 locators and 2 servers was ungracefully shutdown it can > happen that locators that are able to start up are not having most recent > data to bring up Cluster Configuration Service. > If we excute command "show missing-disk-stores" it will not work, since all > servers are down, > so we are stuck in this situation. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-8547) Command "show missing-disk-stores" not working, when all servers are down
[ https://issues.apache.org/jira/browse/GEODE-8547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226624#comment-17226624 ] ASF subversion and git services commented on GEODE-8547: Commit 7cc14eef52e06fe1e8c56bd766df56297b9c9ff8 in geode's branch refs/heads/develop from Mario Ivanac [ https://gitbox.apache.org/repos/asf?p=geode.git;h=7cc14ee ] GEODE-8547: Added impacts to show missing disk-stores (#5567) * GEODE-8547: Added impacts to show missing disk-stores * GEODE-8547: Added DUnit test * GEODE-8547: update after comments * GEODE-8547: remove unused variables * GEODE-8547: update test > Command "show missing-disk-stores" not working, when all servers are down > - > > Key: GEODE-8547 > URL: https://issues.apache.org/jira/browse/GEODE-8547 > Project: Geode > Issue Type: Bug > Components: gfsh >Reporter: Mario Ivanac >Assignee: Mario Ivanac >Priority: Major > Labels: needs-review, pull-request-available > Fix For: 1.14.0 > > > If cluster with 2 locators and 2 servers was ungracefully shutdown it can > happen that locators that are able to start up are not having most recent > data to bring up Cluster Configuration Service. > If we excute command "show missing-disk-stores" it will not work, since all > servers are down, > so we are stuck in this situation. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-8547) Command "show missing-disk-stores" not working, when all servers are down
[ https://issues.apache.org/jira/browse/GEODE-8547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226622#comment-17226622 ] ASF subversion and git services commented on GEODE-8547: Commit 7cc14eef52e06fe1e8c56bd766df56297b9c9ff8 in geode's branch refs/heads/develop from Mario Ivanac [ https://gitbox.apache.org/repos/asf?p=geode.git;h=7cc14ee ] GEODE-8547: Added impacts to show missing disk-stores (#5567) * GEODE-8547: Added impacts to show missing disk-stores * GEODE-8547: Added DUnit test * GEODE-8547: update after comments * GEODE-8547: remove unused variables * GEODE-8547: update test > Command "show missing-disk-stores" not working, when all servers are down > - > > Key: GEODE-8547 > URL: https://issues.apache.org/jira/browse/GEODE-8547 > Project: Geode > Issue Type: Bug > Components: gfsh >Reporter: Mario Ivanac >Assignee: Mario Ivanac >Priority: Major > Labels: needs-review, pull-request-available > Fix For: 1.14.0 > > > If cluster with 2 locators and 2 servers was ungracefully shutdown it can > happen that locators that are able to start up are not having most recent > data to bring up Cluster Configuration Service. > If we excute command "show missing-disk-stores" it will not work, since all > servers are down, > so we are stuck in this situation. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (GEODE-8688) Flaxy C++ Native client integration test cases: PartitionRegionOpsTest.[get|put]PartitionedRegionWithRedundancyServerGoesDownSingleHop
[ https://issues.apache.org/jira/browse/GEODE-8688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alberto Gomez reassigned GEODE-8688: Assignee: Alberto Gomez > Flaxy C++ Native client integration test cases: > PartitionRegionOpsTest.[get|put]PartitionedRegionWithRedundancyServerGoesDownSingleHop > -- > > Key: GEODE-8688 > URL: https://issues.apache.org/jira/browse/GEODE-8688 > Project: Geode > Issue Type: Bug > Components: native client >Affects Versions: 1.13.0 >Reporter: Alberto Gomez >Assignee: Alberto Gomez >Priority: Major > > The following test cases for the C++ native client are flaky: > PartitionRegionOpsTest.getPartitionedRegionWithRedundancyServerGoesDownSingleHop > PartitionRegionOpsTest.putPartitionedRegionWithRedundancyServerGoesDownSingleHop > > They fail very often when run in CI although I have not seen them fail when > executed manually. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (GEODE-8688) Flaxy C++ Native client integration test cases: PartitionRegionOpsTest.[get|put]PartitionedRegionWithRedundancyServerGoesDownSingleHop
Alberto Gomez created GEODE-8688: Summary: Flaxy C++ Native client integration test cases: PartitionRegionOpsTest.[get|put]PartitionedRegionWithRedundancyServerGoesDownSingleHop Key: GEODE-8688 URL: https://issues.apache.org/jira/browse/GEODE-8688 Project: Geode Issue Type: Bug Components: native client Affects Versions: 1.13.0 Reporter: Alberto Gomez The following test cases for the C++ native client are flaky: PartitionRegionOpsTest.getPartitionedRegionWithRedundancyServerGoesDownSingleHop PartitionRegionOpsTest.putPartitionedRegionWithRedundancyServerGoesDownSingleHop They fail very often when run in CI although I have not seen them fail when executed manually. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-8688) Flaxy C++ Native client integration test cases: PartitionRegionOpsTest.[get|put]PartitionedRegionWithRedundancyServerGoesDownSingleHop
[ https://issues.apache.org/jira/browse/GEODE-8688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226730#comment-17226730 ] ASF GitHub Bot commented on GEODE-8688: --- albertogpz opened a new pull request #686: URL: https://github.com/apache/geode-native/pull/686 The following integration test cases under integration/test (new integration tests) ar flaky (do not fail normally when run locally but fail very often when run in CI). - PartitionRegionOpsTest.getPartitionedRegionWithRedundancyServerGoesDownSingleHop - PartitionRegionOpsTest.putPartitionedRegionWithRedundancyServerGoesDownSingleHop There were two reasons that can make them fail. One of them is that sometimes the connections to the server have expired before the server is restarted and therefore, when traffic is sent to the restarted server, no errors are found. To fix this, the pool configuration for the test client has been changed so that connections do not expire. The other reason is that sometimes the error in the connection is found by the ping thread that is invoking the ThinClientPoolDM::sendRequestToEP() method and in this method, when the IO error or TIMEOUT error are encountered, the endpoint is not removed from the metadata (by means of the removeBucketServerLocation method). The code has been updated to remove the metadata also in this case. With these two changes, the test cases are not flaky anymore. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Flaxy C++ Native client integration test cases: > PartitionRegionOpsTest.[get|put]PartitionedRegionWithRedundancyServerGoesDownSingleHop > -- > > Key: GEODE-8688 > URL: https://issues.apache.org/jira/browse/GEODE-8688 > Project: Geode > Issue Type: Bug > Components: native client >Affects Versions: 1.13.0 >Reporter: Alberto Gomez >Assignee: Alberto Gomez >Priority: Major > > The following test cases for the C++ native client are flaky: > PartitionRegionOpsTest.getPartitionedRegionWithRedundancyServerGoesDownSingleHop > PartitionRegionOpsTest.putPartitionedRegionWithRedundancyServerGoesDownSingleHop > > They fail very often when run in CI although I have not seen them fail when > executed manually. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (GEODE-8688) Flaxy C++ Native client integration test cases: PartitionRegionOpsTest.[get|put]PartitionedRegionWithRedundancyServerGoesDownSingleHop
[ https://issues.apache.org/jira/browse/GEODE-8688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated GEODE-8688: -- Labels: pull-request-available (was: ) > Flaxy C++ Native client integration test cases: > PartitionRegionOpsTest.[get|put]PartitionedRegionWithRedundancyServerGoesDownSingleHop > -- > > Key: GEODE-8688 > URL: https://issues.apache.org/jira/browse/GEODE-8688 > Project: Geode > Issue Type: Bug > Components: native client >Affects Versions: 1.13.0 >Reporter: Alberto Gomez >Assignee: Alberto Gomez >Priority: Major > Labels: pull-request-available > > The following test cases for the C++ native client are flaky: > PartitionRegionOpsTest.getPartitionedRegionWithRedundancyServerGoesDownSingleHop > PartitionRegionOpsTest.putPartitionedRegionWithRedundancyServerGoesDownSingleHop > > They fail very often when run in CI although I have not seen them fail when > executed manually. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-8689) CI Failure: DistributedAckPersistentRegionCCEDUnitTest > testConcurrentEventsOnEmptyRegion FAILED
[ https://issues.apache.org/jira/browse/GEODE-8689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226737#comment-17226737 ] Geode Integration commented on GEODE-8689: -- Seen in [DistributedTestOpenJDK11 #574|https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-develop-main/jobs/DistributedTestOpenJDK11/builds/574] ... see [test results|http://files.apachegeode-ci.info/builds/apache-develop-main/1.14.0-build.0467/test-results/distributedTest/1604579602/] or download [artifacts|http://files.apachegeode-ci.info/builds/apache-develop-main/1.14.0-build.0467/test-artifacts/1604579602/distributedtestfiles-OpenJDK11-1.14.0-build.0467.tgz]. > CI Failure: DistributedAckPersistentRegionCCEDUnitTest > > testConcurrentEventsOnEmptyRegion FAILED > - > > Key: GEODE-8689 > URL: https://issues.apache.org/jira/browse/GEODE-8689 > Project: Geode > Issue Type: Bug >Reporter: Sarah Abbey >Priority: Major > > https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-develop-main/jobs/DistributedTestOpenJDK11/builds/574 > {code:java} > org.awaitility.core.ConditionTimeoutException: Condition with alias 'Wait for > the members to eventually be consistent' didn't complete within 5 minutes > because assertion condition defined as a lambda expression in > org.apache.geode.cache30.MultiVMRegionTestCase [region contents are not > consistent for cckey7] expected:<"ccvalue513398912"> but was:. > at org.awaitility.core.ConditionAwaiter.await(ConditionAwaiter.java:165) > at > org.awaitility.core.AssertionCondition.await(AssertionCondition.java:119) > at > org.awaitility.core.AssertionCondition.await(AssertionCondition.java:31) > at org.awaitility.core.ConditionFactory.until(ConditionFactory.java:895) > at > org.awaitility.core.ConditionFactory.untilAsserted(ConditionFactory.java:679) > ... > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (GEODE-8689) CI Failure: DistributedAckPersistentRegionCCEDUnitTest > testConcurrentEventsOnEmptyRegion FAILED
Sarah Abbey created GEODE-8689: -- Summary: CI Failure: DistributedAckPersistentRegionCCEDUnitTest > testConcurrentEventsOnEmptyRegion FAILED Key: GEODE-8689 URL: https://issues.apache.org/jira/browse/GEODE-8689 Project: Geode Issue Type: Bug Reporter: Sarah Abbey https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-develop-main/jobs/DistributedTestOpenJDK11/builds/574 {code:java} org.awaitility.core.ConditionTimeoutException: Condition with alias 'Wait for the members to eventually be consistent' didn't complete within 5 minutes because assertion condition defined as a lambda expression in org.apache.geode.cache30.MultiVMRegionTestCase [region contents are not consistent for cckey7] expected:<"ccvalue513398912"> but was:. at org.awaitility.core.ConditionAwaiter.await(ConditionAwaiter.java:165) at org.awaitility.core.AssertionCondition.await(AssertionCondition.java:119) at org.awaitility.core.AssertionCondition.await(AssertionCondition.java:31) at org.awaitility.core.ConditionFactory.until(ConditionFactory.java:895) at org.awaitility.core.ConditionFactory.untilAsserted(ConditionFactory.java:679) ... {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-8537) Memory increases whenever LRU eviction is enabled
[ https://issues.apache.org/jira/browse/GEODE-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226923#comment-17226923 ] ASF GitHub Bot commented on GEODE-8537: --- gaussianrecurrence opened a new pull request #687: URL: https://github.com/apache/geode-native/pull/687 - Whenever LRU eviction was enabled it was noted an slight increase in the memory usage. Specifically in an scenario in which a set of entries are continously created and destroyed. - Problem was that entries within LRUList where inserted but not removed until LRU eviction happened and in the described case above, that was never. - Solution was to replace the LRUList by a refactored version called LRUQueue and also to remove the entries from the queue upon destroy or invalidation. - Also a dead-lock between EvictionController and EvictionThread has been solved. However this part is asking for a refactor. - Unit tests have been added for the LRUQueue. - Integration test have been added in the new integration test for the LRU eviction. - Also a wrongly implemented LRU eviction test was removed from the old integration tests. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Memory increases whenever LRU eviction is enabled > - > > Key: GEODE-8537 > URL: https://issues.apache.org/jira/browse/GEODE-8537 > Project: Geode > Issue Type: Bug > Components: native client >Affects Versions: 1.13.0 >Reporter: Mario Salazar de Torres >Assignee: Mario Salazar de Torres >Priority: Major > Attachments: massif-8419.png, massif.out.8419 > > > *HAVING* configured concurrency-checks-enabled=false in the client-cache.xml > for a region > *HAVING* configured heap-lru-limit=10 in the client-cache.xml for the region > region > *HAVING* configured heap-lru-delta=10 in the client-cache.xml for the region > region > *HAVING* configured subscription-notification for the pool on which the > region is defined > *HAVING* regsitered interest on all the keys of this region, values included > *AFTER* receiving lots of LOCA_CREATE and LOCAL_DESTROY notifications > *THEN* memory increases continously over time, even going over the LRU limit. > Find massif tool report as massif.out.8419 showing the memory increase. > Also this is a capture of massif-visualizer for the report: > !massif-8419.png! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (GEODE-8537) Memory increases whenever LRU eviction is enabled
[ https://issues.apache.org/jira/browse/GEODE-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated GEODE-8537: -- Labels: pull-request-available (was: ) > Memory increases whenever LRU eviction is enabled > - > > Key: GEODE-8537 > URL: https://issues.apache.org/jira/browse/GEODE-8537 > Project: Geode > Issue Type: Bug > Components: native client >Affects Versions: 1.13.0 >Reporter: Mario Salazar de Torres >Assignee: Mario Salazar de Torres >Priority: Major > Labels: pull-request-available > Attachments: massif-8419.png, massif.out.8419 > > > *HAVING* configured concurrency-checks-enabled=false in the client-cache.xml > for a region > *HAVING* configured heap-lru-limit=10 in the client-cache.xml for the region > region > *HAVING* configured heap-lru-delta=10 in the client-cache.xml for the region > region > *HAVING* configured subscription-notification for the pool on which the > region is defined > *HAVING* regsitered interest on all the keys of this region, values included > *AFTER* receiving lots of LOCA_CREATE and LOCAL_DESTROY notifications > *THEN* memory increases continously over time, even going over the LRU limit. > Find massif tool report as massif.out.8419 showing the memory increase. > Also this is a capture of massif-visualizer for the report: > !massif-8419.png! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-8666) Enforce warning no-non-virtual-dtor
[ https://issues.apache.org/jira/browse/GEODE-8666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226926#comment-17226926 ] ASF GitHub Bot commented on GEODE-8666: --- gaussianrecurrence commented on pull request #680: URL: https://github.com/apache/geode-native/pull/680#issuecomment-722570740 > I'll close this after @gaussianrecurrence reports back with any learnings from why the ABI compliance tool says it is fine (when I also agree it is not). Will also update the JIRA story to reflect that it can't be done yet Sadly nobody seems to know the reason. I guess at least from my side will have to remain a mistery. If you happen to ever find out why, please let me know :S This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Enforce warning no-non-virtual-dtor > --- > > Key: GEODE-8666 > URL: https://issues.apache.org/jira/browse/GEODE-8666 > Project: Geode > Issue Type: Improvement > Components: native client >Reporter: Michael Oleske >Priority: Major > Labels: pull-request-available > > Given I compile the code without exempting no-non-virtual-dtor > Then it should compile > Note - was marked as a todo -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (GEODE-8614) Provide an specific client-side exception for server LowMemoryException
[ https://issues.apache.org/jira/browse/GEODE-8614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated GEODE-8614: -- Labels: pull-request-available (was: ) > Provide an specific client-side exception for server LowMemoryException > --- > > Key: GEODE-8614 > URL: https://issues.apache.org/jira/browse/GEODE-8614 > Project: Geode > Issue Type: Improvement > Components: native client >Affects Versions: 1.11.0, 1.12.0, 1.13.0 >Reporter: Mario Salazar de Torres >Priority: Major > Labels: pull-request-available > > *AS AN* native client contributor > *I WANT* to have a client-side exception for LowMemoryException > *SO THAT* I can nofity accordingly from the client-side upon server > memory-depletion. > — > *Additional information* > This is the callstack of the LowMemoryException: > {noformat} > [error 2020/10/13 09:54:14.401405 UTC 140522117220352] Region::put: An > exception (org.apache.geode.cache.LowMemoryException: PartitionedRegion: > /part_a cannot process operation on key foo|0 because members > [192.168.240.14(dms-server-1:1):41000] are running low on memory > at > org.apache.geode.internal.cache.partitioned.RegionAdvisor.checkIfBucketSick(RegionAdvisor.java:482) > at > org.apache.geode.internal.cache.PartitionedRegion.checkIfAboveThreshold(PartitionedRegion.java:2278) > at > org.apache.geode.internal.cache.PartitionedRegion.putInBucket(PartitionedRegion.java:2982) > at > org.apache.geode.internal.cache.PartitionedRegion.virtualPut(PartitionedRegion.java:2212) > at > org.apache.geode.internal.cache.LocalRegionDataView.putEntry(LocalRegionDataView.java:170) > at > org.apache.geode.internal.cache.LocalRegion.basicUpdate(LocalRegion.java:5573) > at > org.apache.geode.internal.cache.LocalRegion.basicUpdate(LocalRegion.java:5533) > at > org.apache.geode.internal.cache.LocalRegion.basicBridgePut(LocalRegion.java:5212) > at > org.apache.geode.internal.cache.tier.sockets.command.Put65.cmdExecute(Put65.java:411) > at > org.apache.geode.internal.cache.tier.sockets.BaseCommand.execute(BaseCommand.java:183) > at > org.apache.geode.internal.cache.tier.sockets.ServerConnection.doNormalMessage(ServerConnection.java:848) > at > org.apache.geode.internal.cache.tier.sockets.OriginalServerConnection.doOneMessage(OriginalServerConnection.java:72) > at > org.apache.geode.internal.cache.tier.sockets.ServerConnection.run(ServerConnection.java:1212) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at > org.apache.geode.internal.cache.tier.sockets.AcceptorImpl.lambda$initializeServerConnectionThreadPool$3(AcceptorImpl.java:676) > at > org.apache.geode.logging.internal.executors.LoggingThreadFactory.lambda$newThread$0(LoggingThreadFactory.java:119) > at java.base/java.lang.Thread.run(Thread.java:834) ) happened at remote > server. > {noformat} > Idea would be to modify *ThinClientRegion::handleServerException* in order to > return a new error and later on, map it to a new created exception > *Suggestions* > The new exception could be called: > * CacheServerLowMemoryException > * ... -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-8614) Provide an specific client-side exception for server LowMemoryException
[ https://issues.apache.org/jira/browse/GEODE-8614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226929#comment-17226929 ] ASF GitHub Bot commented on GEODE-8614: --- gaussianrecurrence opened a new pull request #688: URL: https://github.com/apache/geode-native/pull/688 - Added LowMemoryException to be thrown in the client whenever the server runs out of memory. - Added QueryExecutionLowMemoryException to be thrown in the client whenever the monitoring queries feature on the server detects that the member is running low on memory. - Added UTs to verity error to exception translation is working. - Added new integration tests for both exceptions. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Provide an specific client-side exception for server LowMemoryException > --- > > Key: GEODE-8614 > URL: https://issues.apache.org/jira/browse/GEODE-8614 > Project: Geode > Issue Type: Improvement > Components: native client >Affects Versions: 1.11.0, 1.12.0, 1.13.0 >Reporter: Mario Salazar de Torres >Priority: Major > > *AS AN* native client contributor > *I WANT* to have a client-side exception for LowMemoryException > *SO THAT* I can nofity accordingly from the client-side upon server > memory-depletion. > — > *Additional information* > This is the callstack of the LowMemoryException: > {noformat} > [error 2020/10/13 09:54:14.401405 UTC 140522117220352] Region::put: An > exception (org.apache.geode.cache.LowMemoryException: PartitionedRegion: > /part_a cannot process operation on key foo|0 because members > [192.168.240.14(dms-server-1:1):41000] are running low on memory > at > org.apache.geode.internal.cache.partitioned.RegionAdvisor.checkIfBucketSick(RegionAdvisor.java:482) > at > org.apache.geode.internal.cache.PartitionedRegion.checkIfAboveThreshold(PartitionedRegion.java:2278) > at > org.apache.geode.internal.cache.PartitionedRegion.putInBucket(PartitionedRegion.java:2982) > at > org.apache.geode.internal.cache.PartitionedRegion.virtualPut(PartitionedRegion.java:2212) > at > org.apache.geode.internal.cache.LocalRegionDataView.putEntry(LocalRegionDataView.java:170) > at > org.apache.geode.internal.cache.LocalRegion.basicUpdate(LocalRegion.java:5573) > at > org.apache.geode.internal.cache.LocalRegion.basicUpdate(LocalRegion.java:5533) > at > org.apache.geode.internal.cache.LocalRegion.basicBridgePut(LocalRegion.java:5212) > at > org.apache.geode.internal.cache.tier.sockets.command.Put65.cmdExecute(Put65.java:411) > at > org.apache.geode.internal.cache.tier.sockets.BaseCommand.execute(BaseCommand.java:183) > at > org.apache.geode.internal.cache.tier.sockets.ServerConnection.doNormalMessage(ServerConnection.java:848) > at > org.apache.geode.internal.cache.tier.sockets.OriginalServerConnection.doOneMessage(OriginalServerConnection.java:72) > at > org.apache.geode.internal.cache.tier.sockets.ServerConnection.run(ServerConnection.java:1212) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at > org.apache.geode.internal.cache.tier.sockets.AcceptorImpl.lambda$initializeServerConnectionThreadPool$3(AcceptorImpl.java:676) > at > org.apache.geode.logging.internal.executors.LoggingThreadFactory.lambda$newThread$0(LoggingThreadFactory.java:119) > at java.base/java.lang.Thread.run(Thread.java:834) ) happened at remote > server. > {noformat} > Idea would be to modify *ThinClientRegion::handleServerException* in order to > return a new error and later on, map it to a new created exception > *Suggestions* > The new exception could be called: > * CacheServerLowMemoryException > * ... -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (GEODE-8690) Member that fails availability check is never suspected again
Bruce J Schuchardt created GEODE-8690: - Summary: Member that fails availability check is never suspected again Key: GEODE-8690 URL: https://issues.apache.org/jira/browse/GEODE-8690 Project: Geode Issue Type: Bug Components: membership Affects Versions: 1.13.0, 1.12.0, 1.14.0 Reporter: Bruce J Schuchardt In a test run on support/1.12 there was a cluster with 3 locators and a number of servers. It had a membership view like this: {noformat} [ loc1, loc2, loc3, server1, server2, etc] {noformat} The test killed loc1 and loc2 and tried to restart loc2. In this scenario loc3 should have detected the loss of the other two locators and it should have become the membership coordinator but it didn't. Loc3 detected the loss of loc2 and then received a LEAVE request from loc1. At that point it ought to have either started examining loc2 again or perhaps just become the coordinator, but it did neither of these and the cluster had no coordinator. This is similar to GEODE-3780 but in that case an earlier availability check passed. In the test run the names of the locators are loc1=locatorgemfire_4_3 loc2=locatorgemfire_4_4 and loc3=locatorgemfire_4_2 {noformat} [info 2020/10/30 21:51:51.197 PDT :41005 shared unordered uid=2 port=42550> tid=0x36] Performing availability check for suspect member (locatorgemfire_4_4_host2_3884:3884:locator):41005 reason=member unexpectedly shut down shared, unordered connection [info 2020/10/30 21:51:51.309 PDT tid=0x51] received leave request from (locatorgemfire_4_3_host2_3866:3866:locator):41004 for (locatorgemfire_4_3_host2_3866:3866:locator):41004 [info 2020/10/30 21:51:51.345 PDT tid=0x51] Checking to see if I should become coordinator. My address is (locatorgemfire_4_2_host2_3852:3852:locator):41007 [info 2020/10/30 21:51:51.346 PDT tid=0x51] View with removed and left members removed is View[rs-(locatorgemfire_4_3_host2_3866:3866:locator):41004|3] members: [(locatorgemfire_4_4_host2_3884:3884:locator):41005, (locatorgemfire_4_2_host2_3852:3852:locator):41007, (locatorgemfire_4_1_host2_3843:3843:locator):41006, (peergemfire_4_1_host2_3959:3959):41010{lead}, (peergemfire_4_2_host2_3967:3967):41009] and coordinator would be (locatorgemfire_4_4_host2_3884:3884:locator):41005 {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-8688) Flaxy C++ Native client integration test cases: PartitionRegionOpsTest.[get|put]PartitionedRegionWithRedundancyServerGoesDownSingleHop
[ https://issues.apache.org/jira/browse/GEODE-8688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226938#comment-17226938 ] ASF GitHub Bot commented on GEODE-8688: --- pdxcodemonkey commented on a change in pull request #686: URL: https://github.com/apache/geode-native/pull/686#discussion_r518308858 ## File path: cppcache/integration/test/PartitionRegionOpsTest.cpp ## @@ -144,6 +146,10 @@ void verifyMetadataWasRemovedAtFirstError() { } } } + std::cout << "timeoutErrors: " << timeoutErrors << ", ioErrors: " << ioErrors +<< ", metadataRemovedDueToTimeout: " << metadataRemovedDueToTimeout +<< ", metadataRemovedDueToIoErr: " << metadataRemovedDueToIoErr +<< std::endl; Review comment: If you really want to see this output when running your test, I believe it's best to use std::cerr rather than cout. This looks like leftover debugging trace stuff to me, tho. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Flaxy C++ Native client integration test cases: > PartitionRegionOpsTest.[get|put]PartitionedRegionWithRedundancyServerGoesDownSingleHop > -- > > Key: GEODE-8688 > URL: https://issues.apache.org/jira/browse/GEODE-8688 > Project: Geode > Issue Type: Bug > Components: native client >Affects Versions: 1.13.0 >Reporter: Alberto Gomez >Assignee: Alberto Gomez >Priority: Major > Labels: pull-request-available > > The following test cases for the C++ native client are flaky: > PartitionRegionOpsTest.getPartitionedRegionWithRedundancyServerGoesDownSingleHop > PartitionRegionOpsTest.putPartitionedRegionWithRedundancyServerGoesDownSingleHop > > They fail very often when run in CI although I have not seen them fail when > executed manually. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-8681) peer-to-peer message loss due to sending connection closing with TLS enabled
[ https://issues.apache.org/jira/browse/GEODE-8681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226948#comment-17226948 ] ASF GitHub Bot commented on GEODE-8681: --- echobravopapa opened a new pull request #5706: URL: https://github.com/apache/geode/pull/5706 …ng with TLS enabled (#5699) A socket-read could pick up more than one message and a single unwrap() could decrypt multiple messages. Normally the engine isn't closed and it reports normal status from an unwrap() operation, and Connection.processInputBuffer picks up each message, one by one, from the buffer and dispatches them. But if the SSLEngine is closed we were ignoring any already-decrypted data sitting in the unwrapped buffer and instead we were throwing an SSLException. (cherry picked from commit 7da8f9b516ac1e2525a1dfc922af7bfb8995f2c6) (cherry picked from commit 03bbc2ac54998cbb015d533e4fe6e75b3e973146) Thank you for submitting a contribution to Apache Geode. In order to streamline the review of the contribution we ask you to ensure the following steps have been taken: ### For all changes: - [ ] Is there a JIRA ticket associated with this PR? Is it referenced in the commit message? - [ ] Has your PR been rebased against the latest commit within the target branch (typically `develop`)? - [ ] Is your initial contribution a single, squashed commit? - [ ] Does `gradlew build` run cleanly? - [ ] Have you written or updated unit tests to verify your changes? - [ ] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)? ### Note: Please ensure that once the PR is submitted, check Concourse for build issues and submit an update to your PR as soon as possible. If you need help, please send an email to d...@geode.apache.org. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > peer-to-peer message loss due to sending connection closing with TLS enabled > > > Key: GEODE-8681 > URL: https://issues.apache.org/jira/browse/GEODE-8681 > Project: Geode > Issue Type: Bug > Components: membership, messaging >Affects Versions: 1.10.0, 1.11.0, 1.12.0, 1.13.0 >Reporter: Bruce J Schuchardt >Assignee: Bruce J Schuchardt >Priority: Major > Labels: pull-request-available, release-blocker > > We have observed message loss when TLS is enabled and a distributed lock is > released right after sending a message that doesn't require acknowledgement > if the sending socket is immediately closed. The closing of sockets > immediately after sending a message is frequently seen in function execution > threads or server-side application threads that use this pattern: > {code:java} > try { > DistributedSystem.setThreadsSocketPolicy(false); > acquireDistributedLock(lockName); > (perform one or more cache operations) > } finally { > distLockService.unlock(lockName); > DistributedSystem.releaseThreadsSockets(); // closes the socket > } > {code} > The fault seems to be in NioSSLEngine.unwrap(), which throws an > SSLException() if it finds the SSLEngine is closed even though there is valid > data in its decrypt buffer. It shouldn't throw an exception in that case. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (GEODE-8686) Tombstone removal optimization during GII could cause deadlock
[ https://issues.apache.org/jira/browse/GEODE-8686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated GEODE-8686: -- Labels: pull-request-available (was: ) > Tombstone removal optimization during GII could cause deadlock > -- > > Key: GEODE-8686 > URL: https://issues.apache.org/jira/browse/GEODE-8686 > Project: Geode > Issue Type: Improvement >Affects Versions: 1.10.0, 1.11.0, 1.12.0, 1.13.0, 1.14.0 >Reporter: Donal Evans >Assignee: Donal Evans >Priority: Major > Labels: pull-request-available > > Similar to the issue described in GEODE-6526, if the condition in the below > if statement in {{AbstractRegionMap.initialImagePut()}} evaluates to true, a > call to {{AbstractRegionMap.removeTombstone()}} will be triggered that could > lead to deadlock between the calling thread and a Tombstone GC thread calling > {{TombstoneService.gcTombstones()}}. > {code:java} > if (owner.getServerProxy() == null && > owner.getVersionVector().isTombstoneTooOld( entryVersion.getMemberID(), > entryVersion.getRegionVersion())) { > // the received tombstone has already been reaped, so don't retain it > if (owner.getIndexManager() != null) { > owner.getIndexManager().updateIndexes(oldRe, IndexManager.REMOVE_ENTRY, > IndexProtocol.REMOVE_DUE_TO_GII_TOMBSTONE_CLEANUP); > } > removeTombstone(oldRe, entryVersion, false, false); > return false; > } else { > owner.scheduleTombstone(oldRe, entryVersion); > lruEntryDestroy(oldRe); > } > {code} > The proposed change is to remove this if statement and allow the old > tombstone to be collected later by calling {{scheduleTombstone()}} in all > cases. The call to {{AbstractRegionMap.removeTombstone()}} in > {{AbstractRegionMap.initialImagePut()}} is intended to be an optimization to > allow immediate removal of tombstones that we know have already been > collected on other members, but since the conditions to trigger it are rare > (the old entry must be a tombstone, the new entry received during GII must be > a tombstone with a newer version, and we must have already collected a > tombstone with a newer version than the new entry) and the overhead of > scheduling a tombstone to be collected is comparatively low, the performance > impact of removing this optimization in favour of simply scheduling the > tombstone to be collected in all cases should be insignificant. > The solution to the deadlock observed in GEODE-6526 was also to remove the > call to {{AbstractRegionMap.removeTombstone()}} and allow the tombstone to be > collected later and did not result in any unwanted behaviour, so the proposed > fix should be similarly low-impact. > Also of note is that with this proposed change, there will be no calls to > {{AbstractRegionMap.removeTombstone()}} outside of the {{TombstoneService}} > class, which should ensure that other deadlocks involving this method are not > possible. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-8686) Tombstone removal optimization during GII could cause deadlock
[ https://issues.apache.org/jira/browse/GEODE-8686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226963#comment-17226963 ] ASF GitHub Bot commented on GEODE-8686: --- DonalEvans opened a new pull request #5707: URL: https://github.com/apache/geode/pull/5707 - Do not call AbstractRegionMap.removeTombstone() outside of TombstoneService class - Add test to confirm that tombstones are correctly scheduled and collected with this change Authored-by: Donal Evans Thank you for submitting a contribution to Apache Geode. In order to streamline the review of the contribution we ask you to ensure the following steps have been taken: ### For all changes: - [x] Is there a JIRA ticket associated with this PR? Is it referenced in the commit message? - [x] Has your PR been rebased against the latest commit within the target branch (typically `develop`)? - [x] Is your initial contribution a single, squashed commit? - [x] Does `gradlew build` run cleanly? - [x] Have you written or updated unit tests to verify your changes? - [N/A] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)? ### Note: Please ensure that once the PR is submitted, check Concourse for build issues and submit an update to your PR as soon as possible. If you need help, please send an email to d...@geode.apache.org. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Tombstone removal optimization during GII could cause deadlock > -- > > Key: GEODE-8686 > URL: https://issues.apache.org/jira/browse/GEODE-8686 > Project: Geode > Issue Type: Improvement >Affects Versions: 1.10.0, 1.11.0, 1.12.0, 1.13.0, 1.14.0 >Reporter: Donal Evans >Assignee: Donal Evans >Priority: Major > > Similar to the issue described in GEODE-6526, if the condition in the below > if statement in {{AbstractRegionMap.initialImagePut()}} evaluates to true, a > call to {{AbstractRegionMap.removeTombstone()}} will be triggered that could > lead to deadlock between the calling thread and a Tombstone GC thread calling > {{TombstoneService.gcTombstones()}}. > {code:java} > if (owner.getServerProxy() == null && > owner.getVersionVector().isTombstoneTooOld( entryVersion.getMemberID(), > entryVersion.getRegionVersion())) { > // the received tombstone has already been reaped, so don't retain it > if (owner.getIndexManager() != null) { > owner.getIndexManager().updateIndexes(oldRe, IndexManager.REMOVE_ENTRY, > IndexProtocol.REMOVE_DUE_TO_GII_TOMBSTONE_CLEANUP); > } > removeTombstone(oldRe, entryVersion, false, false); > return false; > } else { > owner.scheduleTombstone(oldRe, entryVersion); > lruEntryDestroy(oldRe); > } > {code} > The proposed change is to remove this if statement and allow the old > tombstone to be collected later by calling {{scheduleTombstone()}} in all > cases. The call to {{AbstractRegionMap.removeTombstone()}} in > {{AbstractRegionMap.initialImagePut()}} is intended to be an optimization to > allow immediate removal of tombstones that we know have already been > collected on other members, but since the conditions to trigger it are rare > (the old entry must be a tombstone, the new entry received during GII must be > a tombstone with a newer version, and we must have already collected a > tombstone with a newer version than the new entry) and the overhead of > scheduling a tombstone to be collected is comparatively low, the performance > impact of removing this optimization in favour of simply scheduling the > tombstone to be collected in all cases should be insignificant. > The solution to the deadlock observed in GEODE-6526 was also to remove the > call to {{AbstractRegionMap.removeTombstone()}} and allow the tombstone to be > collected later and did not result in any unwanted behaviour, so the proposed > fix should be similarly low-impact. > Also of note is that with this proposed change, there will be no calls to > {{AbstractRegionMap.removeTombstone()}} outside of the {{TombstoneService}} > class, which should ensure that other deadlocks involving this method are not > possible. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-8652) member hung in Connection.notifyHandshakeWaiter() during disconnect waiting for a lock held by another thread in Connection.readAck()
[ https://issues.apache.org/jira/browse/GEODE-8652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226964#comment-17226964 ] ASF GitHub Bot commented on GEODE-8652: --- Bill opened a new pull request #5708: URL: https://github.com/apache/geode/pull/5708 Reverts apache/geode#5694 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > member hung in Connection.notifyHandshakeWaiter() during disconnect waiting > for a lock held by another thread in Connection.readAck() > -- > > Key: GEODE-8652 > URL: https://issues.apache.org/jira/browse/GEODE-8652 > Project: Geode > Issue Type: Bug > Components: membership, messaging >Affects Versions: 1.12.0, 1.13.0, 1.14.0 >Reporter: Bill Burcham >Assignee: Bill Burcham >Priority: Major > Labels: pull-request-available > Fix For: 1.12.1, 1.14.0, 1.13.1 > > > An application encountered the following hang in a TLS-enabled cluster. > Let's call the cluster members ds3 -> ds1. > ds3 sends a {{PutAllPRMessage}} to ds1 and is stuck in > {{SocketChannel.read()}} waiting for the acknowledgement. > {{ClusterDistributionManager.uncleanShutdown()}} is invoked on ds3 to shut > the member down. That thread blocks trying to acquire a lock on the > {{NioSslEngine}} held by the first thread (the one doing waiting for the ack > to the put-all.) > Somehow the shutdown thread must be allowed to proceed. > Here's the hung thread in ds3 (a.k.a. vm_3_thr_34_dataStore3_host1_8592) > trying to shut down the member but it's stuck waiting for the monitor on the > {{NioSslEngine}}: > {noformat} > "vm_3_thr_34_dataStore3_host1_8592" #795 daemon prio=5 os_prio=0 > tid=0x7fdb9c011000 nid=0x2fcc waiting for monitor entry > [0x7fdb6f4b7000] >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.geode.internal.tcp.Connection.notifyHandshakeWaiter(Connection.java:804) > - waiting to lock <0xf2635b28> (a > org.apache.geode.internal.net.NioSslEngine) > at org.apache.geode.internal.tcp.Connection.close(Connection.java:1350) > at > org.apache.geode.internal.tcp.Connection.closePartialConnect(Connection.java:1278) > at > org.apache.geode.internal.tcp.ConnectionTable.closeCon(ConnectionTable.java:612) > at > org.apache.geode.internal.tcp.ConnectionTable.closeCon(ConnectionTable.java:604) > at > org.apache.geode.internal.tcp.ConnectionTable.close(ConnectionTable.java:661) > - locked <0xf2678cf8> (a java.util.ArrayList) > - locked <0xf1187348> (a java.util.concurrent.ConcurrentHashMap) > at org.apache.geode.internal.tcp.TCPConduit.stop(TCPConduit.java:487) > at > org.apache.geode.distributed.internal.direct.DirectChannel.disconnect(DirectChannel.java:644) > - locked <0xf11867a8> (a > org.apache.geode.distributed.internal.direct.DirectChannel) > at > org.apache.geode.distributed.internal.DistributionImpl.disconnectDirectChannel(DistributionImpl.java:631) > at > org.apache.geode.distributed.internal.DistributionImpl.access$200(DistributionImpl.java:82) > at > org.apache.geode.distributed.internal.DistributionImpl$LifecycleListenerImpl.disconnect(DistributionImpl.java:904) > at > org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.stop(GMSMembership.java:1908) > at > org.apache.geode.distributed.internal.membership.gms.Services.stop(Services.java:302) > at > org.apache.geode.distributed.internal.membership.gms.GMSMembership.shutdown(GMSMembership.java:1262) > at > org.apache.geode.distributed.internal.membership.gms.GMSMembership.disconnect(GMSMembership.java:1847) > at > org.apache.geode.distributed.internal.DistributionImpl.disconnect(DistributionImpl.java:501) > at > org.apache.geode.distributed.internal.ClusterDistributionManager.uncleanShutdown(ClusterDistributionManager.java:1291) > {noformat} > That thread is waiting on a lock held by this thread (in ds3) which is > waiting on an acknowledgement to a PutAllPRMessage sent to ds1. > {noformat} > "vm_3_thr_37_dataStore3_host1_8592" #857 daemon prio=5 os_prio=0 > tid=0x7fdb9c030800 nid=0x30d1 runnable [0x7fdb732f] >java.lang.Thread.State: RUNNABLE > at sun.nio.ch.FileDispatcherImpl.read0(Native Method) > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) > at sun.nio.
[jira] [Commented] (GEODE-8652) member hung in Connection.notifyHandshakeWaiter() during disconnect waiting for a lock held by another thread in Connection.readAck()
[ https://issues.apache.org/jira/browse/GEODE-8652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226966#comment-17226966 ] ASF subversion and git services commented on GEODE-8652: Commit 9ef2718f243f34306880efc749d46d2d25172b4b in geode's branch refs/heads/revert-5694-backport-1-12-GEODE-8652-and-friends from Bill Burcham [ https://gitbox.apache.org/repos/asf?p=geode.git;h=9ef2718 ] Revert "GEODE-8652: NioSslEngine.close() Bypasses Locks (#5666)" This reverts commit 06642ead279c500180f396c865b6277cb92ae27d. > member hung in Connection.notifyHandshakeWaiter() during disconnect waiting > for a lock held by another thread in Connection.readAck() > -- > > Key: GEODE-8652 > URL: https://issues.apache.org/jira/browse/GEODE-8652 > Project: Geode > Issue Type: Bug > Components: membership, messaging >Affects Versions: 1.12.0, 1.13.0, 1.14.0 >Reporter: Bill Burcham >Assignee: Bill Burcham >Priority: Major > Labels: pull-request-available > Fix For: 1.12.1, 1.14.0, 1.13.1 > > > An application encountered the following hang in a TLS-enabled cluster. > Let's call the cluster members ds3 -> ds1. > ds3 sends a {{PutAllPRMessage}} to ds1 and is stuck in > {{SocketChannel.read()}} waiting for the acknowledgement. > {{ClusterDistributionManager.uncleanShutdown()}} is invoked on ds3 to shut > the member down. That thread blocks trying to acquire a lock on the > {{NioSslEngine}} held by the first thread (the one doing waiting for the ack > to the put-all.) > Somehow the shutdown thread must be allowed to proceed. > Here's the hung thread in ds3 (a.k.a. vm_3_thr_34_dataStore3_host1_8592) > trying to shut down the member but it's stuck waiting for the monitor on the > {{NioSslEngine}}: > {noformat} > "vm_3_thr_34_dataStore3_host1_8592" #795 daemon prio=5 os_prio=0 > tid=0x7fdb9c011000 nid=0x2fcc waiting for monitor entry > [0x7fdb6f4b7000] >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.geode.internal.tcp.Connection.notifyHandshakeWaiter(Connection.java:804) > - waiting to lock <0xf2635b28> (a > org.apache.geode.internal.net.NioSslEngine) > at org.apache.geode.internal.tcp.Connection.close(Connection.java:1350) > at > org.apache.geode.internal.tcp.Connection.closePartialConnect(Connection.java:1278) > at > org.apache.geode.internal.tcp.ConnectionTable.closeCon(ConnectionTable.java:612) > at > org.apache.geode.internal.tcp.ConnectionTable.closeCon(ConnectionTable.java:604) > at > org.apache.geode.internal.tcp.ConnectionTable.close(ConnectionTable.java:661) > - locked <0xf2678cf8> (a java.util.ArrayList) > - locked <0xf1187348> (a java.util.concurrent.ConcurrentHashMap) > at org.apache.geode.internal.tcp.TCPConduit.stop(TCPConduit.java:487) > at > org.apache.geode.distributed.internal.direct.DirectChannel.disconnect(DirectChannel.java:644) > - locked <0xf11867a8> (a > org.apache.geode.distributed.internal.direct.DirectChannel) > at > org.apache.geode.distributed.internal.DistributionImpl.disconnectDirectChannel(DistributionImpl.java:631) > at > org.apache.geode.distributed.internal.DistributionImpl.access$200(DistributionImpl.java:82) > at > org.apache.geode.distributed.internal.DistributionImpl$LifecycleListenerImpl.disconnect(DistributionImpl.java:904) > at > org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.stop(GMSMembership.java:1908) > at > org.apache.geode.distributed.internal.membership.gms.Services.stop(Services.java:302) > at > org.apache.geode.distributed.internal.membership.gms.GMSMembership.shutdown(GMSMembership.java:1262) > at > org.apache.geode.distributed.internal.membership.gms.GMSMembership.disconnect(GMSMembership.java:1847) > at > org.apache.geode.distributed.internal.DistributionImpl.disconnect(DistributionImpl.java:501) > at > org.apache.geode.distributed.internal.ClusterDistributionManager.uncleanShutdown(ClusterDistributionManager.java:1291) > {noformat} > That thread is waiting on a lock held by this thread (in ds3) which is > waiting on an acknowledgement to a PutAllPRMessage sent to ds1. > {noformat} > "vm_3_thr_37_dataStore3_host1_8592" #857 daemon prio=5 os_prio=0 > tid=0x7fdb9c030800 nid=0x30d1 runnable [0x7fdb732f] >java.lang.Thread.State: RUNNABLE > at sun.nio.ch.FileDispatcherImpl.read0(Native Method) > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) > at sun.
[jira] [Commented] (GEODE-8652) member hung in Connection.notifyHandshakeWaiter() during disconnect waiting for a lock held by another thread in Connection.readAck()
[ https://issues.apache.org/jira/browse/GEODE-8652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226965#comment-17226965 ] ASF GitHub Bot commented on GEODE-8652: --- Bill opened a new pull request #5709: URL: https://github.com/apache/geode/pull/5709 Reverts apache/geode#5693 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > member hung in Connection.notifyHandshakeWaiter() during disconnect waiting > for a lock held by another thread in Connection.readAck() > -- > > Key: GEODE-8652 > URL: https://issues.apache.org/jira/browse/GEODE-8652 > Project: Geode > Issue Type: Bug > Components: membership, messaging >Affects Versions: 1.12.0, 1.13.0, 1.14.0 >Reporter: Bill Burcham >Assignee: Bill Burcham >Priority: Major > Labels: pull-request-available > Fix For: 1.12.1, 1.14.0, 1.13.1 > > > An application encountered the following hang in a TLS-enabled cluster. > Let's call the cluster members ds3 -> ds1. > ds3 sends a {{PutAllPRMessage}} to ds1 and is stuck in > {{SocketChannel.read()}} waiting for the acknowledgement. > {{ClusterDistributionManager.uncleanShutdown()}} is invoked on ds3 to shut > the member down. That thread blocks trying to acquire a lock on the > {{NioSslEngine}} held by the first thread (the one doing waiting for the ack > to the put-all.) > Somehow the shutdown thread must be allowed to proceed. > Here's the hung thread in ds3 (a.k.a. vm_3_thr_34_dataStore3_host1_8592) > trying to shut down the member but it's stuck waiting for the monitor on the > {{NioSslEngine}}: > {noformat} > "vm_3_thr_34_dataStore3_host1_8592" #795 daemon prio=5 os_prio=0 > tid=0x7fdb9c011000 nid=0x2fcc waiting for monitor entry > [0x7fdb6f4b7000] >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.geode.internal.tcp.Connection.notifyHandshakeWaiter(Connection.java:804) > - waiting to lock <0xf2635b28> (a > org.apache.geode.internal.net.NioSslEngine) > at org.apache.geode.internal.tcp.Connection.close(Connection.java:1350) > at > org.apache.geode.internal.tcp.Connection.closePartialConnect(Connection.java:1278) > at > org.apache.geode.internal.tcp.ConnectionTable.closeCon(ConnectionTable.java:612) > at > org.apache.geode.internal.tcp.ConnectionTable.closeCon(ConnectionTable.java:604) > at > org.apache.geode.internal.tcp.ConnectionTable.close(ConnectionTable.java:661) > - locked <0xf2678cf8> (a java.util.ArrayList) > - locked <0xf1187348> (a java.util.concurrent.ConcurrentHashMap) > at org.apache.geode.internal.tcp.TCPConduit.stop(TCPConduit.java:487) > at > org.apache.geode.distributed.internal.direct.DirectChannel.disconnect(DirectChannel.java:644) > - locked <0xf11867a8> (a > org.apache.geode.distributed.internal.direct.DirectChannel) > at > org.apache.geode.distributed.internal.DistributionImpl.disconnectDirectChannel(DistributionImpl.java:631) > at > org.apache.geode.distributed.internal.DistributionImpl.access$200(DistributionImpl.java:82) > at > org.apache.geode.distributed.internal.DistributionImpl$LifecycleListenerImpl.disconnect(DistributionImpl.java:904) > at > org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.stop(GMSMembership.java:1908) > at > org.apache.geode.distributed.internal.membership.gms.Services.stop(Services.java:302) > at > org.apache.geode.distributed.internal.membership.gms.GMSMembership.shutdown(GMSMembership.java:1262) > at > org.apache.geode.distributed.internal.membership.gms.GMSMembership.disconnect(GMSMembership.java:1847) > at > org.apache.geode.distributed.internal.DistributionImpl.disconnect(DistributionImpl.java:501) > at > org.apache.geode.distributed.internal.ClusterDistributionManager.uncleanShutdown(ClusterDistributionManager.java:1291) > {noformat} > That thread is waiting on a lock held by this thread (in ds3) which is > waiting on an acknowledgement to a PutAllPRMessage sent to ds1. > {noformat} > "vm_3_thr_37_dataStore3_host1_8592" #857 daemon prio=5 os_prio=0 > tid=0x7fdb9c030800 nid=0x30d1 runnable [0x7fdb732f] >java.lang.Thread.State: RUNNABLE > at sun.nio.ch.FileDispatcherImpl.read0(Native Method) > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) > at sun.nio.
[jira] [Commented] (GEODE-8652) member hung in Connection.notifyHandshakeWaiter() during disconnect waiting for a lock held by another thread in Connection.readAck()
[ https://issues.apache.org/jira/browse/GEODE-8652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226967#comment-17226967 ] ASF subversion and git services commented on GEODE-8652: Commit 9ef2718f243f34306880efc749d46d2d25172b4b in geode's branch refs/heads/revert-5694-backport-1-12-GEODE-8652-and-friends from Bill Burcham [ https://gitbox.apache.org/repos/asf?p=geode.git;h=9ef2718 ] Revert "GEODE-8652: NioSslEngine.close() Bypasses Locks (#5666)" This reverts commit 06642ead279c500180f396c865b6277cb92ae27d. > member hung in Connection.notifyHandshakeWaiter() during disconnect waiting > for a lock held by another thread in Connection.readAck() > -- > > Key: GEODE-8652 > URL: https://issues.apache.org/jira/browse/GEODE-8652 > Project: Geode > Issue Type: Bug > Components: membership, messaging >Affects Versions: 1.12.0, 1.13.0, 1.14.0 >Reporter: Bill Burcham >Assignee: Bill Burcham >Priority: Major > Labels: pull-request-available > Fix For: 1.12.1, 1.14.0, 1.13.1 > > > An application encountered the following hang in a TLS-enabled cluster. > Let's call the cluster members ds3 -> ds1. > ds3 sends a {{PutAllPRMessage}} to ds1 and is stuck in > {{SocketChannel.read()}} waiting for the acknowledgement. > {{ClusterDistributionManager.uncleanShutdown()}} is invoked on ds3 to shut > the member down. That thread blocks trying to acquire a lock on the > {{NioSslEngine}} held by the first thread (the one doing waiting for the ack > to the put-all.) > Somehow the shutdown thread must be allowed to proceed. > Here's the hung thread in ds3 (a.k.a. vm_3_thr_34_dataStore3_host1_8592) > trying to shut down the member but it's stuck waiting for the monitor on the > {{NioSslEngine}}: > {noformat} > "vm_3_thr_34_dataStore3_host1_8592" #795 daemon prio=5 os_prio=0 > tid=0x7fdb9c011000 nid=0x2fcc waiting for monitor entry > [0x7fdb6f4b7000] >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.geode.internal.tcp.Connection.notifyHandshakeWaiter(Connection.java:804) > - waiting to lock <0xf2635b28> (a > org.apache.geode.internal.net.NioSslEngine) > at org.apache.geode.internal.tcp.Connection.close(Connection.java:1350) > at > org.apache.geode.internal.tcp.Connection.closePartialConnect(Connection.java:1278) > at > org.apache.geode.internal.tcp.ConnectionTable.closeCon(ConnectionTable.java:612) > at > org.apache.geode.internal.tcp.ConnectionTable.closeCon(ConnectionTable.java:604) > at > org.apache.geode.internal.tcp.ConnectionTable.close(ConnectionTable.java:661) > - locked <0xf2678cf8> (a java.util.ArrayList) > - locked <0xf1187348> (a java.util.concurrent.ConcurrentHashMap) > at org.apache.geode.internal.tcp.TCPConduit.stop(TCPConduit.java:487) > at > org.apache.geode.distributed.internal.direct.DirectChannel.disconnect(DirectChannel.java:644) > - locked <0xf11867a8> (a > org.apache.geode.distributed.internal.direct.DirectChannel) > at > org.apache.geode.distributed.internal.DistributionImpl.disconnectDirectChannel(DistributionImpl.java:631) > at > org.apache.geode.distributed.internal.DistributionImpl.access$200(DistributionImpl.java:82) > at > org.apache.geode.distributed.internal.DistributionImpl$LifecycleListenerImpl.disconnect(DistributionImpl.java:904) > at > org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.stop(GMSMembership.java:1908) > at > org.apache.geode.distributed.internal.membership.gms.Services.stop(Services.java:302) > at > org.apache.geode.distributed.internal.membership.gms.GMSMembership.shutdown(GMSMembership.java:1262) > at > org.apache.geode.distributed.internal.membership.gms.GMSMembership.disconnect(GMSMembership.java:1847) > at > org.apache.geode.distributed.internal.DistributionImpl.disconnect(DistributionImpl.java:501) > at > org.apache.geode.distributed.internal.ClusterDistributionManager.uncleanShutdown(ClusterDistributionManager.java:1291) > {noformat} > That thread is waiting on a lock held by this thread (in ds3) which is > waiting on an acknowledgement to a PutAllPRMessage sent to ds1. > {noformat} > "vm_3_thr_37_dataStore3_host1_8592" #857 daemon prio=5 os_prio=0 > tid=0x7fdb9c030800 nid=0x30d1 runnable [0x7fdb732f] >java.lang.Thread.State: RUNNABLE > at sun.nio.ch.FileDispatcherImpl.read0(Native Method) > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) > at sun.
[jira] [Commented] (GEODE-8652) member hung in Connection.notifyHandshakeWaiter() during disconnect waiting for a lock held by another thread in Connection.readAck()
[ https://issues.apache.org/jira/browse/GEODE-8652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226969#comment-17226969 ] ASF subversion and git services commented on GEODE-8652: Commit 4954648d5801148db42973315ab439fad86d4c1a in geode's branch refs/heads/revert-5693-backport-1-13-GEODE-8652-and-friends from Bill Burcham [ https://gitbox.apache.org/repos/asf?p=geode.git;h=4954648 ] Revert "GEODE-8652: NioSslEngine.close() Bypasses Locks (#5666)" This reverts commit b2af727ce23fd155f3665e3db2ecee6e8f80fba7. > member hung in Connection.notifyHandshakeWaiter() during disconnect waiting > for a lock held by another thread in Connection.readAck() > -- > > Key: GEODE-8652 > URL: https://issues.apache.org/jira/browse/GEODE-8652 > Project: Geode > Issue Type: Bug > Components: membership, messaging >Affects Versions: 1.12.0, 1.13.0, 1.14.0 >Reporter: Bill Burcham >Assignee: Bill Burcham >Priority: Major > Labels: pull-request-available > Fix For: 1.12.1, 1.14.0, 1.13.1 > > > An application encountered the following hang in a TLS-enabled cluster. > Let's call the cluster members ds3 -> ds1. > ds3 sends a {{PutAllPRMessage}} to ds1 and is stuck in > {{SocketChannel.read()}} waiting for the acknowledgement. > {{ClusterDistributionManager.uncleanShutdown()}} is invoked on ds3 to shut > the member down. That thread blocks trying to acquire a lock on the > {{NioSslEngine}} held by the first thread (the one doing waiting for the ack > to the put-all.) > Somehow the shutdown thread must be allowed to proceed. > Here's the hung thread in ds3 (a.k.a. vm_3_thr_34_dataStore3_host1_8592) > trying to shut down the member but it's stuck waiting for the monitor on the > {{NioSslEngine}}: > {noformat} > "vm_3_thr_34_dataStore3_host1_8592" #795 daemon prio=5 os_prio=0 > tid=0x7fdb9c011000 nid=0x2fcc waiting for monitor entry > [0x7fdb6f4b7000] >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.geode.internal.tcp.Connection.notifyHandshakeWaiter(Connection.java:804) > - waiting to lock <0xf2635b28> (a > org.apache.geode.internal.net.NioSslEngine) > at org.apache.geode.internal.tcp.Connection.close(Connection.java:1350) > at > org.apache.geode.internal.tcp.Connection.closePartialConnect(Connection.java:1278) > at > org.apache.geode.internal.tcp.ConnectionTable.closeCon(ConnectionTable.java:612) > at > org.apache.geode.internal.tcp.ConnectionTable.closeCon(ConnectionTable.java:604) > at > org.apache.geode.internal.tcp.ConnectionTable.close(ConnectionTable.java:661) > - locked <0xf2678cf8> (a java.util.ArrayList) > - locked <0xf1187348> (a java.util.concurrent.ConcurrentHashMap) > at org.apache.geode.internal.tcp.TCPConduit.stop(TCPConduit.java:487) > at > org.apache.geode.distributed.internal.direct.DirectChannel.disconnect(DirectChannel.java:644) > - locked <0xf11867a8> (a > org.apache.geode.distributed.internal.direct.DirectChannel) > at > org.apache.geode.distributed.internal.DistributionImpl.disconnectDirectChannel(DistributionImpl.java:631) > at > org.apache.geode.distributed.internal.DistributionImpl.access$200(DistributionImpl.java:82) > at > org.apache.geode.distributed.internal.DistributionImpl$LifecycleListenerImpl.disconnect(DistributionImpl.java:904) > at > org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.stop(GMSMembership.java:1908) > at > org.apache.geode.distributed.internal.membership.gms.Services.stop(Services.java:302) > at > org.apache.geode.distributed.internal.membership.gms.GMSMembership.shutdown(GMSMembership.java:1262) > at > org.apache.geode.distributed.internal.membership.gms.GMSMembership.disconnect(GMSMembership.java:1847) > at > org.apache.geode.distributed.internal.DistributionImpl.disconnect(DistributionImpl.java:501) > at > org.apache.geode.distributed.internal.ClusterDistributionManager.uncleanShutdown(ClusterDistributionManager.java:1291) > {noformat} > That thread is waiting on a lock held by this thread (in ds3) which is > waiting on an acknowledgement to a PutAllPRMessage sent to ds1. > {noformat} > "vm_3_thr_37_dataStore3_host1_8592" #857 daemon prio=5 os_prio=0 > tid=0x7fdb9c030800 nid=0x30d1 runnable [0x7fdb732f] >java.lang.Thread.State: RUNNABLE > at sun.nio.ch.FileDispatcherImpl.read0(Native Method) > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) > at sun.
[jira] [Commented] (GEODE-8540) Repackage DUnitBlackboard in a JUnit Rule
[ https://issues.apache.org/jira/browse/GEODE-8540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226970#comment-17226970 ] ASF subversion and git services commented on GEODE-8540: Commit b06e328798d27c30683a850241681338ac7fed55 in geode's branch refs/heads/revert-5693-backport-1-13-GEODE-8652-and-friends from Bill Burcham [ https://gitbox.apache.org/repos/asf?p=geode.git;h=b06e328 ] Revert "GEODE-8540: Create new DistributedBlackboard Rule (#5557)" This reverts commit cde469c6b6955a334e6bbf22accfc0735f0c70f4. > Repackage DUnitBlackboard in a JUnit Rule > - > > Key: GEODE-8540 > URL: https://issues.apache.org/jira/browse/GEODE-8540 > Project: Geode > Issue Type: Wish > Components: tests >Reporter: Kirk Lund >Assignee: Kirk Lund >Priority: Major > Labels: pull-request-available > Fix For: 1.14.0, 1.13.1 > > > Repackage DUnitBlackboard in a JUnit Rule -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-8136) Repackage and improve javadocs for UncheckedUtils
[ https://issues.apache.org/jira/browse/GEODE-8136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226971#comment-17226971 ] ASF subversion and git services commented on GEODE-8136: Commit ba3b156ec6907b773b66c8628f6507cd5f5d2d4f in geode's branch refs/heads/revert-5693-backport-1-13-GEODE-8652-and-friends from Bill Burcham [ https://gitbox.apache.org/repos/asf?p=geode.git;h=ba3b156 ] Revert "GEODE-8136: Move UncheckedUtils to geode-common (#5123)" This reverts commit 10af7ea015ec85ef02b2e972c7a3dd3ec23bcb7f. > Repackage and improve javadocs for UncheckedUtils > - > > Key: GEODE-8136 > URL: https://issues.apache.org/jira/browse/GEODE-8136 > Project: Geode > Issue Type: Wish > Components: core >Reporter: Kirk Lund >Assignee: Kirk Lund >Priority: Major > Fix For: 1.14.0, 1.13.1 > > > UncheckedUtils is a collection of simple utilities for unchecked casts in > both test and product code. We should move it to the most common module it > can live in, rename methods to be more description, and add javadocs. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-8136) Repackage and improve javadocs for UncheckedUtils
[ https://issues.apache.org/jira/browse/GEODE-8136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226974#comment-17226974 ] ASF subversion and git services commented on GEODE-8136: Commit ba3b156ec6907b773b66c8628f6507cd5f5d2d4f in geode's branch refs/heads/revert-5693-backport-1-13-GEODE-8652-and-friends from Bill Burcham [ https://gitbox.apache.org/repos/asf?p=geode.git;h=ba3b156 ] Revert "GEODE-8136: Move UncheckedUtils to geode-common (#5123)" This reverts commit 10af7ea015ec85ef02b2e972c7a3dd3ec23bcb7f. > Repackage and improve javadocs for UncheckedUtils > - > > Key: GEODE-8136 > URL: https://issues.apache.org/jira/browse/GEODE-8136 > Project: Geode > Issue Type: Wish > Components: core >Reporter: Kirk Lund >Assignee: Kirk Lund >Priority: Major > Fix For: 1.14.0, 1.13.1 > > > UncheckedUtils is a collection of simple utilities for unchecked casts in > both test and product code. We should move it to the most common module it > can live in, rename methods to be more description, and add javadocs. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-8540) Repackage DUnitBlackboard in a JUnit Rule
[ https://issues.apache.org/jira/browse/GEODE-8540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226973#comment-17226973 ] ASF subversion and git services commented on GEODE-8540: Commit b06e328798d27c30683a850241681338ac7fed55 in geode's branch refs/heads/revert-5693-backport-1-13-GEODE-8652-and-friends from Bill Burcham [ https://gitbox.apache.org/repos/asf?p=geode.git;h=b06e328 ] Revert "GEODE-8540: Create new DistributedBlackboard Rule (#5557)" This reverts commit cde469c6b6955a334e6bbf22accfc0735f0c70f4. > Repackage DUnitBlackboard in a JUnit Rule > - > > Key: GEODE-8540 > URL: https://issues.apache.org/jira/browse/GEODE-8540 > Project: Geode > Issue Type: Wish > Components: tests >Reporter: Kirk Lund >Assignee: Kirk Lund >Priority: Major > Labels: pull-request-available > Fix For: 1.14.0, 1.13.1 > > > Repackage DUnitBlackboard in a JUnit Rule -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-8652) member hung in Connection.notifyHandshakeWaiter() during disconnect waiting for a lock held by another thread in Connection.readAck()
[ https://issues.apache.org/jira/browse/GEODE-8652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226972#comment-17226972 ] ASF subversion and git services commented on GEODE-8652: Commit 4954648d5801148db42973315ab439fad86d4c1a in geode's branch refs/heads/revert-5693-backport-1-13-GEODE-8652-and-friends from Bill Burcham [ https://gitbox.apache.org/repos/asf?p=geode.git;h=4954648 ] Revert "GEODE-8652: NioSslEngine.close() Bypasses Locks (#5666)" This reverts commit b2af727ce23fd155f3665e3db2ecee6e8f80fba7. > member hung in Connection.notifyHandshakeWaiter() during disconnect waiting > for a lock held by another thread in Connection.readAck() > -- > > Key: GEODE-8652 > URL: https://issues.apache.org/jira/browse/GEODE-8652 > Project: Geode > Issue Type: Bug > Components: membership, messaging >Affects Versions: 1.12.0, 1.13.0, 1.14.0 >Reporter: Bill Burcham >Assignee: Bill Burcham >Priority: Major > Labels: pull-request-available > Fix For: 1.12.1, 1.14.0, 1.13.1 > > > An application encountered the following hang in a TLS-enabled cluster. > Let's call the cluster members ds3 -> ds1. > ds3 sends a {{PutAllPRMessage}} to ds1 and is stuck in > {{SocketChannel.read()}} waiting for the acknowledgement. > {{ClusterDistributionManager.uncleanShutdown()}} is invoked on ds3 to shut > the member down. That thread blocks trying to acquire a lock on the > {{NioSslEngine}} held by the first thread (the one doing waiting for the ack > to the put-all.) > Somehow the shutdown thread must be allowed to proceed. > Here's the hung thread in ds3 (a.k.a. vm_3_thr_34_dataStore3_host1_8592) > trying to shut down the member but it's stuck waiting for the monitor on the > {{NioSslEngine}}: > {noformat} > "vm_3_thr_34_dataStore3_host1_8592" #795 daemon prio=5 os_prio=0 > tid=0x7fdb9c011000 nid=0x2fcc waiting for monitor entry > [0x7fdb6f4b7000] >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.geode.internal.tcp.Connection.notifyHandshakeWaiter(Connection.java:804) > - waiting to lock <0xf2635b28> (a > org.apache.geode.internal.net.NioSslEngine) > at org.apache.geode.internal.tcp.Connection.close(Connection.java:1350) > at > org.apache.geode.internal.tcp.Connection.closePartialConnect(Connection.java:1278) > at > org.apache.geode.internal.tcp.ConnectionTable.closeCon(ConnectionTable.java:612) > at > org.apache.geode.internal.tcp.ConnectionTable.closeCon(ConnectionTable.java:604) > at > org.apache.geode.internal.tcp.ConnectionTable.close(ConnectionTable.java:661) > - locked <0xf2678cf8> (a java.util.ArrayList) > - locked <0xf1187348> (a java.util.concurrent.ConcurrentHashMap) > at org.apache.geode.internal.tcp.TCPConduit.stop(TCPConduit.java:487) > at > org.apache.geode.distributed.internal.direct.DirectChannel.disconnect(DirectChannel.java:644) > - locked <0xf11867a8> (a > org.apache.geode.distributed.internal.direct.DirectChannel) > at > org.apache.geode.distributed.internal.DistributionImpl.disconnectDirectChannel(DistributionImpl.java:631) > at > org.apache.geode.distributed.internal.DistributionImpl.access$200(DistributionImpl.java:82) > at > org.apache.geode.distributed.internal.DistributionImpl$LifecycleListenerImpl.disconnect(DistributionImpl.java:904) > at > org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.stop(GMSMembership.java:1908) > at > org.apache.geode.distributed.internal.membership.gms.Services.stop(Services.java:302) > at > org.apache.geode.distributed.internal.membership.gms.GMSMembership.shutdown(GMSMembership.java:1262) > at > org.apache.geode.distributed.internal.membership.gms.GMSMembership.disconnect(GMSMembership.java:1847) > at > org.apache.geode.distributed.internal.DistributionImpl.disconnect(DistributionImpl.java:501) > at > org.apache.geode.distributed.internal.ClusterDistributionManager.uncleanShutdown(ClusterDistributionManager.java:1291) > {noformat} > That thread is waiting on a lock held by this thread (in ds3) which is > waiting on an acknowledgement to a PutAllPRMessage sent to ds1. > {noformat} > "vm_3_thr_37_dataStore3_host1_8592" #857 daemon prio=5 os_prio=0 > tid=0x7fdb9c030800 nid=0x30d1 runnable [0x7fdb732f] >java.lang.Thread.State: RUNNABLE > at sun.nio.ch.FileDispatcherImpl.read0(Native Method) > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) > at sun.
[jira] [Commented] (GEODE-8652) member hung in Connection.notifyHandshakeWaiter() during disconnect waiting for a lock held by another thread in Connection.readAck()
[ https://issues.apache.org/jira/browse/GEODE-8652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226981#comment-17226981 ] ASF GitHub Bot commented on GEODE-8652: --- Bill opened a new pull request #5710: URL: https://github.com/apache/geode/pull/5710 This reverts commit 08e9e9673d0ed0a3d74c6d16e706817cab09. Thank you for submitting a contribution to Apache Geode. In order to streamline the review of the contribution we ask you to ensure the following steps have been taken: ### For all changes: - [ ] Is there a JIRA ticket associated with this PR? Is it referenced in the commit message? - [ ] Has your PR been rebased against the latest commit within the target branch (typically `develop`)? - [ ] Is your initial contribution a single, squashed commit? - [ ] Does `gradlew build` run cleanly? - [ ] Have you written or updated unit tests to verify your changes? - [ ] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)? ### Note: Please ensure that once the PR is submitted, check Concourse for build issues and submit an update to your PR as soon as possible. If you need help, please send an email to d...@geode.apache.org. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > member hung in Connection.notifyHandshakeWaiter() during disconnect waiting > for a lock held by another thread in Connection.readAck() > -- > > Key: GEODE-8652 > URL: https://issues.apache.org/jira/browse/GEODE-8652 > Project: Geode > Issue Type: Bug > Components: membership, messaging >Affects Versions: 1.12.0, 1.13.0, 1.14.0 >Reporter: Bill Burcham >Assignee: Bill Burcham >Priority: Major > Labels: pull-request-available > Fix For: 1.12.1, 1.14.0, 1.13.1 > > > An application encountered the following hang in a TLS-enabled cluster. > Let's call the cluster members ds3 -> ds1. > ds3 sends a {{PutAllPRMessage}} to ds1 and is stuck in > {{SocketChannel.read()}} waiting for the acknowledgement. > {{ClusterDistributionManager.uncleanShutdown()}} is invoked on ds3 to shut > the member down. That thread blocks trying to acquire a lock on the > {{NioSslEngine}} held by the first thread (the one doing waiting for the ack > to the put-all.) > Somehow the shutdown thread must be allowed to proceed. > Here's the hung thread in ds3 (a.k.a. vm_3_thr_34_dataStore3_host1_8592) > trying to shut down the member but it's stuck waiting for the monitor on the > {{NioSslEngine}}: > {noformat} > "vm_3_thr_34_dataStore3_host1_8592" #795 daemon prio=5 os_prio=0 > tid=0x7fdb9c011000 nid=0x2fcc waiting for monitor entry > [0x7fdb6f4b7000] >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.geode.internal.tcp.Connection.notifyHandshakeWaiter(Connection.java:804) > - waiting to lock <0xf2635b28> (a > org.apache.geode.internal.net.NioSslEngine) > at org.apache.geode.internal.tcp.Connection.close(Connection.java:1350) > at > org.apache.geode.internal.tcp.Connection.closePartialConnect(Connection.java:1278) > at > org.apache.geode.internal.tcp.ConnectionTable.closeCon(ConnectionTable.java:612) > at > org.apache.geode.internal.tcp.ConnectionTable.closeCon(ConnectionTable.java:604) > at > org.apache.geode.internal.tcp.ConnectionTable.close(ConnectionTable.java:661) > - locked <0xf2678cf8> (a java.util.ArrayList) > - locked <0xf1187348> (a java.util.concurrent.ConcurrentHashMap) > at org.apache.geode.internal.tcp.TCPConduit.stop(TCPConduit.java:487) > at > org.apache.geode.distributed.internal.direct.DirectChannel.disconnect(DirectChannel.java:644) > - locked <0xf11867a8> (a > org.apache.geode.distributed.internal.direct.DirectChannel) > at > org.apache.geode.distributed.internal.DistributionImpl.disconnectDirectChannel(DistributionImpl.java:631) > at > org.apache.geode.distributed.internal.DistributionImpl.access$200(DistributionImpl.java:82) > at > org.apache.geode.distributed.internal.DistributionImpl$LifecycleListenerImpl.disconnect(DistributionImpl.java:904) > at > org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.stop(GMSMembersh
[jira] [Commented] (GEODE-8686) Tombstone removal optimization during GII could cause deadlock
[ https://issues.apache.org/jira/browse/GEODE-8686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226990#comment-17226990 ] ASF GitHub Bot commented on GEODE-8686: --- lgtm-com[bot] commented on pull request #5707: URL: https://github.com/apache/geode/pull/5707#issuecomment-722646152 This pull request **fixes 1 alert** when merging f42efca780a24a697672b6d4f04fd66d82fa730a into 7cc14eef52e06fe1e8c56bd766df56297b9c9ff8 - [view on LGTM.com](https://lgtm.com/projects/g/apache/geode/rev/pr-28c30f2e70e63f14ced912c0945c9dd00166b91c) **fixed alerts:** * 1 for Dereferenced variable may be null This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Tombstone removal optimization during GII could cause deadlock > -- > > Key: GEODE-8686 > URL: https://issues.apache.org/jira/browse/GEODE-8686 > Project: Geode > Issue Type: Improvement >Affects Versions: 1.10.0, 1.11.0, 1.12.0, 1.13.0, 1.14.0 >Reporter: Donal Evans >Assignee: Donal Evans >Priority: Major > Labels: pull-request-available > > Similar to the issue described in GEODE-6526, if the condition in the below > if statement in {{AbstractRegionMap.initialImagePut()}} evaluates to true, a > call to {{AbstractRegionMap.removeTombstone()}} will be triggered that could > lead to deadlock between the calling thread and a Tombstone GC thread calling > {{TombstoneService.gcTombstones()}}. > {code:java} > if (owner.getServerProxy() == null && > owner.getVersionVector().isTombstoneTooOld( entryVersion.getMemberID(), > entryVersion.getRegionVersion())) { > // the received tombstone has already been reaped, so don't retain it > if (owner.getIndexManager() != null) { > owner.getIndexManager().updateIndexes(oldRe, IndexManager.REMOVE_ENTRY, > IndexProtocol.REMOVE_DUE_TO_GII_TOMBSTONE_CLEANUP); > } > removeTombstone(oldRe, entryVersion, false, false); > return false; > } else { > owner.scheduleTombstone(oldRe, entryVersion); > lruEntryDestroy(oldRe); > } > {code} > The proposed change is to remove this if statement and allow the old > tombstone to be collected later by calling {{scheduleTombstone()}} in all > cases. The call to {{AbstractRegionMap.removeTombstone()}} in > {{AbstractRegionMap.initialImagePut()}} is intended to be an optimization to > allow immediate removal of tombstones that we know have already been > collected on other members, but since the conditions to trigger it are rare > (the old entry must be a tombstone, the new entry received during GII must be > a tombstone with a newer version, and we must have already collected a > tombstone with a newer version than the new entry) and the overhead of > scheduling a tombstone to be collected is comparatively low, the performance > impact of removing this optimization in favour of simply scheduling the > tombstone to be collected in all cases should be insignificant. > The solution to the deadlock observed in GEODE-6526 was also to remove the > call to {{AbstractRegionMap.removeTombstone()}} and allow the tombstone to be > collected later and did not result in any unwanted behaviour, so the proposed > fix should be similarly low-impact. > Also of note is that with this proposed change, there will be no calls to > {{AbstractRegionMap.removeTombstone()}} outside of the {{TombstoneService}} > class, which should ensure that other deadlocks involving this method are not > possible. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-8652) member hung in Connection.notifyHandshakeWaiter() during disconnect waiting for a lock held by another thread in Connection.readAck()
[ https://issues.apache.org/jira/browse/GEODE-8652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226996#comment-17226996 ] ASF GitHub Bot commented on GEODE-8652: --- Bill merged pull request #5708: URL: https://github.com/apache/geode/pull/5708 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > member hung in Connection.notifyHandshakeWaiter() during disconnect waiting > for a lock held by another thread in Connection.readAck() > -- > > Key: GEODE-8652 > URL: https://issues.apache.org/jira/browse/GEODE-8652 > Project: Geode > Issue Type: Bug > Components: membership, messaging >Affects Versions: 1.12.0, 1.13.0, 1.14.0 >Reporter: Bill Burcham >Assignee: Bill Burcham >Priority: Major > Labels: pull-request-available > Fix For: 1.12.1, 1.14.0, 1.13.1 > > > An application encountered the following hang in a TLS-enabled cluster. > Let's call the cluster members ds3 -> ds1. > ds3 sends a {{PutAllPRMessage}} to ds1 and is stuck in > {{SocketChannel.read()}} waiting for the acknowledgement. > {{ClusterDistributionManager.uncleanShutdown()}} is invoked on ds3 to shut > the member down. That thread blocks trying to acquire a lock on the > {{NioSslEngine}} held by the first thread (the one doing waiting for the ack > to the put-all.) > Somehow the shutdown thread must be allowed to proceed. > Here's the hung thread in ds3 (a.k.a. vm_3_thr_34_dataStore3_host1_8592) > trying to shut down the member but it's stuck waiting for the monitor on the > {{NioSslEngine}}: > {noformat} > "vm_3_thr_34_dataStore3_host1_8592" #795 daemon prio=5 os_prio=0 > tid=0x7fdb9c011000 nid=0x2fcc waiting for monitor entry > [0x7fdb6f4b7000] >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.geode.internal.tcp.Connection.notifyHandshakeWaiter(Connection.java:804) > - waiting to lock <0xf2635b28> (a > org.apache.geode.internal.net.NioSslEngine) > at org.apache.geode.internal.tcp.Connection.close(Connection.java:1350) > at > org.apache.geode.internal.tcp.Connection.closePartialConnect(Connection.java:1278) > at > org.apache.geode.internal.tcp.ConnectionTable.closeCon(ConnectionTable.java:612) > at > org.apache.geode.internal.tcp.ConnectionTable.closeCon(ConnectionTable.java:604) > at > org.apache.geode.internal.tcp.ConnectionTable.close(ConnectionTable.java:661) > - locked <0xf2678cf8> (a java.util.ArrayList) > - locked <0xf1187348> (a java.util.concurrent.ConcurrentHashMap) > at org.apache.geode.internal.tcp.TCPConduit.stop(TCPConduit.java:487) > at > org.apache.geode.distributed.internal.direct.DirectChannel.disconnect(DirectChannel.java:644) > - locked <0xf11867a8> (a > org.apache.geode.distributed.internal.direct.DirectChannel) > at > org.apache.geode.distributed.internal.DistributionImpl.disconnectDirectChannel(DistributionImpl.java:631) > at > org.apache.geode.distributed.internal.DistributionImpl.access$200(DistributionImpl.java:82) > at > org.apache.geode.distributed.internal.DistributionImpl$LifecycleListenerImpl.disconnect(DistributionImpl.java:904) > at > org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.stop(GMSMembership.java:1908) > at > org.apache.geode.distributed.internal.membership.gms.Services.stop(Services.java:302) > at > org.apache.geode.distributed.internal.membership.gms.GMSMembership.shutdown(GMSMembership.java:1262) > at > org.apache.geode.distributed.internal.membership.gms.GMSMembership.disconnect(GMSMembership.java:1847) > at > org.apache.geode.distributed.internal.DistributionImpl.disconnect(DistributionImpl.java:501) > at > org.apache.geode.distributed.internal.ClusterDistributionManager.uncleanShutdown(ClusterDistributionManager.java:1291) > {noformat} > That thread is waiting on a lock held by this thread (in ds3) which is > waiting on an acknowledgement to a PutAllPRMessage sent to ds1. > {noformat} > "vm_3_thr_37_dataStore3_host1_8592" #857 daemon prio=5 os_prio=0 > tid=0x7fdb9c030800 nid=0x30d1 runnable [0x7fdb732f] >java.lang.Thread.State: RUNNABLE > at sun.nio.ch.FileDispatcherImpl.read0(Native Method) > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) > at sun.nio.ch.IOUtil.readIntoNativeBuffer(
[jira] [Commented] (GEODE-8652) member hung in Connection.notifyHandshakeWaiter() during disconnect waiting for a lock held by another thread in Connection.readAck()
[ https://issues.apache.org/jira/browse/GEODE-8652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226997#comment-17226997 ] ASF GitHub Bot commented on GEODE-8652: --- Bill merged pull request #5709: URL: https://github.com/apache/geode/pull/5709 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > member hung in Connection.notifyHandshakeWaiter() during disconnect waiting > for a lock held by another thread in Connection.readAck() > -- > > Key: GEODE-8652 > URL: https://issues.apache.org/jira/browse/GEODE-8652 > Project: Geode > Issue Type: Bug > Components: membership, messaging >Affects Versions: 1.12.0, 1.13.0, 1.14.0 >Reporter: Bill Burcham >Assignee: Bill Burcham >Priority: Major > Labels: pull-request-available > Fix For: 1.12.1, 1.14.0, 1.13.1 > > > An application encountered the following hang in a TLS-enabled cluster. > Let's call the cluster members ds3 -> ds1. > ds3 sends a {{PutAllPRMessage}} to ds1 and is stuck in > {{SocketChannel.read()}} waiting for the acknowledgement. > {{ClusterDistributionManager.uncleanShutdown()}} is invoked on ds3 to shut > the member down. That thread blocks trying to acquire a lock on the > {{NioSslEngine}} held by the first thread (the one doing waiting for the ack > to the put-all.) > Somehow the shutdown thread must be allowed to proceed. > Here's the hung thread in ds3 (a.k.a. vm_3_thr_34_dataStore3_host1_8592) > trying to shut down the member but it's stuck waiting for the monitor on the > {{NioSslEngine}}: > {noformat} > "vm_3_thr_34_dataStore3_host1_8592" #795 daemon prio=5 os_prio=0 > tid=0x7fdb9c011000 nid=0x2fcc waiting for monitor entry > [0x7fdb6f4b7000] >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.geode.internal.tcp.Connection.notifyHandshakeWaiter(Connection.java:804) > - waiting to lock <0xf2635b28> (a > org.apache.geode.internal.net.NioSslEngine) > at org.apache.geode.internal.tcp.Connection.close(Connection.java:1350) > at > org.apache.geode.internal.tcp.Connection.closePartialConnect(Connection.java:1278) > at > org.apache.geode.internal.tcp.ConnectionTable.closeCon(ConnectionTable.java:612) > at > org.apache.geode.internal.tcp.ConnectionTable.closeCon(ConnectionTable.java:604) > at > org.apache.geode.internal.tcp.ConnectionTable.close(ConnectionTable.java:661) > - locked <0xf2678cf8> (a java.util.ArrayList) > - locked <0xf1187348> (a java.util.concurrent.ConcurrentHashMap) > at org.apache.geode.internal.tcp.TCPConduit.stop(TCPConduit.java:487) > at > org.apache.geode.distributed.internal.direct.DirectChannel.disconnect(DirectChannel.java:644) > - locked <0xf11867a8> (a > org.apache.geode.distributed.internal.direct.DirectChannel) > at > org.apache.geode.distributed.internal.DistributionImpl.disconnectDirectChannel(DistributionImpl.java:631) > at > org.apache.geode.distributed.internal.DistributionImpl.access$200(DistributionImpl.java:82) > at > org.apache.geode.distributed.internal.DistributionImpl$LifecycleListenerImpl.disconnect(DistributionImpl.java:904) > at > org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.stop(GMSMembership.java:1908) > at > org.apache.geode.distributed.internal.membership.gms.Services.stop(Services.java:302) > at > org.apache.geode.distributed.internal.membership.gms.GMSMembership.shutdown(GMSMembership.java:1262) > at > org.apache.geode.distributed.internal.membership.gms.GMSMembership.disconnect(GMSMembership.java:1847) > at > org.apache.geode.distributed.internal.DistributionImpl.disconnect(DistributionImpl.java:501) > at > org.apache.geode.distributed.internal.ClusterDistributionManager.uncleanShutdown(ClusterDistributionManager.java:1291) > {noformat} > That thread is waiting on a lock held by this thread (in ds3) which is > waiting on an acknowledgement to a PutAllPRMessage sent to ds1. > {noformat} > "vm_3_thr_37_dataStore3_host1_8592" #857 daemon prio=5 os_prio=0 > tid=0x7fdb9c030800 nid=0x30d1 runnable [0x7fdb732f] >java.lang.Thread.State: RUNNABLE > at sun.nio.ch.FileDispatcherImpl.read0(Native Method) > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) > at sun.nio.ch.IOUtil.readIntoNativeBuffer(
[jira] [Commented] (GEODE-8652) member hung in Connection.notifyHandshakeWaiter() during disconnect waiting for a lock held by another thread in Connection.readAck()
[ https://issues.apache.org/jira/browse/GEODE-8652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226999#comment-17226999 ] ASF subversion and git services commented on GEODE-8652: Commit bec47047dec2ddb64e000b71004fbef8ed3b2b88 in geode's branch refs/heads/support/1.12 from Bill Burcham [ https://gitbox.apache.org/repos/asf?p=geode.git;h=bec4704 ] Revert "GEODE-8652: NioSslEngine.close() Bypasses Locks (#5666)" This reverts commit 06642ead279c500180f396c865b6277cb92ae27d. > member hung in Connection.notifyHandshakeWaiter() during disconnect waiting > for a lock held by another thread in Connection.readAck() > -- > > Key: GEODE-8652 > URL: https://issues.apache.org/jira/browse/GEODE-8652 > Project: Geode > Issue Type: Bug > Components: membership, messaging >Affects Versions: 1.12.0, 1.13.0, 1.14.0 >Reporter: Bill Burcham >Assignee: Bill Burcham >Priority: Major > Labels: pull-request-available > Fix For: 1.12.1, 1.14.0, 1.13.1 > > > An application encountered the following hang in a TLS-enabled cluster. > Let's call the cluster members ds3 -> ds1. > ds3 sends a {{PutAllPRMessage}} to ds1 and is stuck in > {{SocketChannel.read()}} waiting for the acknowledgement. > {{ClusterDistributionManager.uncleanShutdown()}} is invoked on ds3 to shut > the member down. That thread blocks trying to acquire a lock on the > {{NioSslEngine}} held by the first thread (the one doing waiting for the ack > to the put-all.) > Somehow the shutdown thread must be allowed to proceed. > Here's the hung thread in ds3 (a.k.a. vm_3_thr_34_dataStore3_host1_8592) > trying to shut down the member but it's stuck waiting for the monitor on the > {{NioSslEngine}}: > {noformat} > "vm_3_thr_34_dataStore3_host1_8592" #795 daemon prio=5 os_prio=0 > tid=0x7fdb9c011000 nid=0x2fcc waiting for monitor entry > [0x7fdb6f4b7000] >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.geode.internal.tcp.Connection.notifyHandshakeWaiter(Connection.java:804) > - waiting to lock <0xf2635b28> (a > org.apache.geode.internal.net.NioSslEngine) > at org.apache.geode.internal.tcp.Connection.close(Connection.java:1350) > at > org.apache.geode.internal.tcp.Connection.closePartialConnect(Connection.java:1278) > at > org.apache.geode.internal.tcp.ConnectionTable.closeCon(ConnectionTable.java:612) > at > org.apache.geode.internal.tcp.ConnectionTable.closeCon(ConnectionTable.java:604) > at > org.apache.geode.internal.tcp.ConnectionTable.close(ConnectionTable.java:661) > - locked <0xf2678cf8> (a java.util.ArrayList) > - locked <0xf1187348> (a java.util.concurrent.ConcurrentHashMap) > at org.apache.geode.internal.tcp.TCPConduit.stop(TCPConduit.java:487) > at > org.apache.geode.distributed.internal.direct.DirectChannel.disconnect(DirectChannel.java:644) > - locked <0xf11867a8> (a > org.apache.geode.distributed.internal.direct.DirectChannel) > at > org.apache.geode.distributed.internal.DistributionImpl.disconnectDirectChannel(DistributionImpl.java:631) > at > org.apache.geode.distributed.internal.DistributionImpl.access$200(DistributionImpl.java:82) > at > org.apache.geode.distributed.internal.DistributionImpl$LifecycleListenerImpl.disconnect(DistributionImpl.java:904) > at > org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.stop(GMSMembership.java:1908) > at > org.apache.geode.distributed.internal.membership.gms.Services.stop(Services.java:302) > at > org.apache.geode.distributed.internal.membership.gms.GMSMembership.shutdown(GMSMembership.java:1262) > at > org.apache.geode.distributed.internal.membership.gms.GMSMembership.disconnect(GMSMembership.java:1847) > at > org.apache.geode.distributed.internal.DistributionImpl.disconnect(DistributionImpl.java:501) > at > org.apache.geode.distributed.internal.ClusterDistributionManager.uncleanShutdown(ClusterDistributionManager.java:1291) > {noformat} > That thread is waiting on a lock held by this thread (in ds3) which is > waiting on an acknowledgement to a PutAllPRMessage sent to ds1. > {noformat} > "vm_3_thr_37_dataStore3_host1_8592" #857 daemon prio=5 os_prio=0 > tid=0x7fdb9c030800 nid=0x30d1 runnable [0x7fdb732f] >java.lang.Thread.State: RUNNABLE > at sun.nio.ch.FileDispatcherImpl.read0(Native Method) > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) > at sun.nio.ch.IOUtil.read(IOUtil.java:192)
[jira] [Commented] (GEODE-8652) member hung in Connection.notifyHandshakeWaiter() during disconnect waiting for a lock held by another thread in Connection.readAck()
[ https://issues.apache.org/jira/browse/GEODE-8652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227000#comment-17227000 ] ASF subversion and git services commented on GEODE-8652: Commit ef74657254c2b2707a31b43af52af1734b71e961 in geode's branch refs/heads/support/1.13 from Bill Burcham [ https://gitbox.apache.org/repos/asf?p=geode.git;h=ef74657 ] Revert "GEODE-8652: NioSslEngine.close() Bypasses Locks (#5666)" This reverts commit b2af727ce23fd155f3665e3db2ecee6e8f80fba7. > member hung in Connection.notifyHandshakeWaiter() during disconnect waiting > for a lock held by another thread in Connection.readAck() > -- > > Key: GEODE-8652 > URL: https://issues.apache.org/jira/browse/GEODE-8652 > Project: Geode > Issue Type: Bug > Components: membership, messaging >Affects Versions: 1.12.0, 1.13.0, 1.14.0 >Reporter: Bill Burcham >Assignee: Bill Burcham >Priority: Major > Labels: pull-request-available > Fix For: 1.12.1, 1.14.0, 1.13.1 > > > An application encountered the following hang in a TLS-enabled cluster. > Let's call the cluster members ds3 -> ds1. > ds3 sends a {{PutAllPRMessage}} to ds1 and is stuck in > {{SocketChannel.read()}} waiting for the acknowledgement. > {{ClusterDistributionManager.uncleanShutdown()}} is invoked on ds3 to shut > the member down. That thread blocks trying to acquire a lock on the > {{NioSslEngine}} held by the first thread (the one doing waiting for the ack > to the put-all.) > Somehow the shutdown thread must be allowed to proceed. > Here's the hung thread in ds3 (a.k.a. vm_3_thr_34_dataStore3_host1_8592) > trying to shut down the member but it's stuck waiting for the monitor on the > {{NioSslEngine}}: > {noformat} > "vm_3_thr_34_dataStore3_host1_8592" #795 daemon prio=5 os_prio=0 > tid=0x7fdb9c011000 nid=0x2fcc waiting for monitor entry > [0x7fdb6f4b7000] >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.geode.internal.tcp.Connection.notifyHandshakeWaiter(Connection.java:804) > - waiting to lock <0xf2635b28> (a > org.apache.geode.internal.net.NioSslEngine) > at org.apache.geode.internal.tcp.Connection.close(Connection.java:1350) > at > org.apache.geode.internal.tcp.Connection.closePartialConnect(Connection.java:1278) > at > org.apache.geode.internal.tcp.ConnectionTable.closeCon(ConnectionTable.java:612) > at > org.apache.geode.internal.tcp.ConnectionTable.closeCon(ConnectionTable.java:604) > at > org.apache.geode.internal.tcp.ConnectionTable.close(ConnectionTable.java:661) > - locked <0xf2678cf8> (a java.util.ArrayList) > - locked <0xf1187348> (a java.util.concurrent.ConcurrentHashMap) > at org.apache.geode.internal.tcp.TCPConduit.stop(TCPConduit.java:487) > at > org.apache.geode.distributed.internal.direct.DirectChannel.disconnect(DirectChannel.java:644) > - locked <0xf11867a8> (a > org.apache.geode.distributed.internal.direct.DirectChannel) > at > org.apache.geode.distributed.internal.DistributionImpl.disconnectDirectChannel(DistributionImpl.java:631) > at > org.apache.geode.distributed.internal.DistributionImpl.access$200(DistributionImpl.java:82) > at > org.apache.geode.distributed.internal.DistributionImpl$LifecycleListenerImpl.disconnect(DistributionImpl.java:904) > at > org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.stop(GMSMembership.java:1908) > at > org.apache.geode.distributed.internal.membership.gms.Services.stop(Services.java:302) > at > org.apache.geode.distributed.internal.membership.gms.GMSMembership.shutdown(GMSMembership.java:1262) > at > org.apache.geode.distributed.internal.membership.gms.GMSMembership.disconnect(GMSMembership.java:1847) > at > org.apache.geode.distributed.internal.DistributionImpl.disconnect(DistributionImpl.java:501) > at > org.apache.geode.distributed.internal.ClusterDistributionManager.uncleanShutdown(ClusterDistributionManager.java:1291) > {noformat} > That thread is waiting on a lock held by this thread (in ds3) which is > waiting on an acknowledgement to a PutAllPRMessage sent to ds1. > {noformat} > "vm_3_thr_37_dataStore3_host1_8592" #857 daemon prio=5 os_prio=0 > tid=0x7fdb9c030800 nid=0x30d1 runnable [0x7fdb732f] >java.lang.Thread.State: RUNNABLE > at sun.nio.ch.FileDispatcherImpl.read0(Native Method) > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) > at sun.nio.ch.IOUtil.read(IOUtil.java:192)
[jira] [Commented] (GEODE-8136) Repackage and improve javadocs for UncheckedUtils
[ https://issues.apache.org/jira/browse/GEODE-8136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227002#comment-17227002 ] ASF subversion and git services commented on GEODE-8136: Commit 986334e9198a1756b839d0d13028f4a846ea29b5 in geode's branch refs/heads/support/1.13 from Bill Burcham [ https://gitbox.apache.org/repos/asf?p=geode.git;h=986334e ] Revert "GEODE-8136: Move UncheckedUtils to geode-common (#5123)" This reverts commit 10af7ea015ec85ef02b2e972c7a3dd3ec23bcb7f. > Repackage and improve javadocs for UncheckedUtils > - > > Key: GEODE-8136 > URL: https://issues.apache.org/jira/browse/GEODE-8136 > Project: Geode > Issue Type: Wish > Components: core >Reporter: Kirk Lund >Assignee: Kirk Lund >Priority: Major > Fix For: 1.14.0, 1.13.1 > > > UncheckedUtils is a collection of simple utilities for unchecked casts in > both test and product code. We should move it to the most common module it > can live in, rename methods to be more description, and add javadocs. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-8540) Repackage DUnitBlackboard in a JUnit Rule
[ https://issues.apache.org/jira/browse/GEODE-8540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227001#comment-17227001 ] ASF subversion and git services commented on GEODE-8540: Commit 4886d2055f9cd0792694d0edb61537429a037439 in geode's branch refs/heads/support/1.13 from Bill Burcham [ https://gitbox.apache.org/repos/asf?p=geode.git;h=4886d20 ] Revert "GEODE-8540: Create new DistributedBlackboard Rule (#5557)" This reverts commit cde469c6b6955a334e6bbf22accfc0735f0c70f4. > Repackage DUnitBlackboard in a JUnit Rule > - > > Key: GEODE-8540 > URL: https://issues.apache.org/jira/browse/GEODE-8540 > Project: Geode > Issue Type: Wish > Components: tests >Reporter: Kirk Lund >Assignee: Kirk Lund >Priority: Major > Labels: pull-request-available > Fix For: 1.14.0, 1.13.1 > > > Repackage DUnitBlackboard in a JUnit Rule -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-8652) member hung in Connection.notifyHandshakeWaiter() during disconnect waiting for a lock held by another thread in Connection.readAck()
[ https://issues.apache.org/jira/browse/GEODE-8652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227018#comment-17227018 ] ASF GitHub Bot commented on GEODE-8652: --- Bill merged pull request #5710: URL: https://github.com/apache/geode/pull/5710 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > member hung in Connection.notifyHandshakeWaiter() during disconnect waiting > for a lock held by another thread in Connection.readAck() > -- > > Key: GEODE-8652 > URL: https://issues.apache.org/jira/browse/GEODE-8652 > Project: Geode > Issue Type: Bug > Components: membership, messaging >Affects Versions: 1.12.0, 1.13.0, 1.14.0 >Reporter: Bill Burcham >Assignee: Bill Burcham >Priority: Major > Labels: pull-request-available > Fix For: 1.12.1, 1.14.0, 1.13.1 > > > An application encountered the following hang in a TLS-enabled cluster. > Let's call the cluster members ds3 -> ds1. > ds3 sends a {{PutAllPRMessage}} to ds1 and is stuck in > {{SocketChannel.read()}} waiting for the acknowledgement. > {{ClusterDistributionManager.uncleanShutdown()}} is invoked on ds3 to shut > the member down. That thread blocks trying to acquire a lock on the > {{NioSslEngine}} held by the first thread (the one doing waiting for the ack > to the put-all.) > Somehow the shutdown thread must be allowed to proceed. > Here's the hung thread in ds3 (a.k.a. vm_3_thr_34_dataStore3_host1_8592) > trying to shut down the member but it's stuck waiting for the monitor on the > {{NioSslEngine}}: > {noformat} > "vm_3_thr_34_dataStore3_host1_8592" #795 daemon prio=5 os_prio=0 > tid=0x7fdb9c011000 nid=0x2fcc waiting for monitor entry > [0x7fdb6f4b7000] >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.geode.internal.tcp.Connection.notifyHandshakeWaiter(Connection.java:804) > - waiting to lock <0xf2635b28> (a > org.apache.geode.internal.net.NioSslEngine) > at org.apache.geode.internal.tcp.Connection.close(Connection.java:1350) > at > org.apache.geode.internal.tcp.Connection.closePartialConnect(Connection.java:1278) > at > org.apache.geode.internal.tcp.ConnectionTable.closeCon(ConnectionTable.java:612) > at > org.apache.geode.internal.tcp.ConnectionTable.closeCon(ConnectionTable.java:604) > at > org.apache.geode.internal.tcp.ConnectionTable.close(ConnectionTable.java:661) > - locked <0xf2678cf8> (a java.util.ArrayList) > - locked <0xf1187348> (a java.util.concurrent.ConcurrentHashMap) > at org.apache.geode.internal.tcp.TCPConduit.stop(TCPConduit.java:487) > at > org.apache.geode.distributed.internal.direct.DirectChannel.disconnect(DirectChannel.java:644) > - locked <0xf11867a8> (a > org.apache.geode.distributed.internal.direct.DirectChannel) > at > org.apache.geode.distributed.internal.DistributionImpl.disconnectDirectChannel(DistributionImpl.java:631) > at > org.apache.geode.distributed.internal.DistributionImpl.access$200(DistributionImpl.java:82) > at > org.apache.geode.distributed.internal.DistributionImpl$LifecycleListenerImpl.disconnect(DistributionImpl.java:904) > at > org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.stop(GMSMembership.java:1908) > at > org.apache.geode.distributed.internal.membership.gms.Services.stop(Services.java:302) > at > org.apache.geode.distributed.internal.membership.gms.GMSMembership.shutdown(GMSMembership.java:1262) > at > org.apache.geode.distributed.internal.membership.gms.GMSMembership.disconnect(GMSMembership.java:1847) > at > org.apache.geode.distributed.internal.DistributionImpl.disconnect(DistributionImpl.java:501) > at > org.apache.geode.distributed.internal.ClusterDistributionManager.uncleanShutdown(ClusterDistributionManager.java:1291) > {noformat} > That thread is waiting on a lock held by this thread (in ds3) which is > waiting on an acknowledgement to a PutAllPRMessage sent to ds1. > {noformat} > "vm_3_thr_37_dataStore3_host1_8592" #857 daemon prio=5 os_prio=0 > tid=0x7fdb9c030800 nid=0x30d1 runnable [0x7fdb732f] >java.lang.Thread.State: RUNNABLE > at sun.nio.ch.FileDispatcherImpl.read0(Native Method) > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) > at sun.nio.ch.IOUtil.readIntoNativeBuffer(
[jira] [Commented] (GEODE-8652) member hung in Connection.notifyHandshakeWaiter() during disconnect waiting for a lock held by another thread in Connection.readAck()
[ https://issues.apache.org/jira/browse/GEODE-8652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227019#comment-17227019 ] ASF subversion and git services commented on GEODE-8652: Commit 9653a0b6e490272fa77d375049f0e9f1cb6c8929 in geode's branch refs/heads/develop from Bill Burcham [ https://gitbox.apache.org/repos/asf?p=geode.git;h=9653a0b ] Revert "GEODE-8652: NioSslEngine.close() Bypasses Locks (#5666)" This reverts commit 08e9e9673d0ed0a3d74c6d16e706817cab09. > member hung in Connection.notifyHandshakeWaiter() during disconnect waiting > for a lock held by another thread in Connection.readAck() > -- > > Key: GEODE-8652 > URL: https://issues.apache.org/jira/browse/GEODE-8652 > Project: Geode > Issue Type: Bug > Components: membership, messaging >Affects Versions: 1.12.0, 1.13.0, 1.14.0 >Reporter: Bill Burcham >Assignee: Bill Burcham >Priority: Major > Labels: pull-request-available > Fix For: 1.12.1, 1.14.0, 1.13.1 > > > An application encountered the following hang in a TLS-enabled cluster. > Let's call the cluster members ds3 -> ds1. > ds3 sends a {{PutAllPRMessage}} to ds1 and is stuck in > {{SocketChannel.read()}} waiting for the acknowledgement. > {{ClusterDistributionManager.uncleanShutdown()}} is invoked on ds3 to shut > the member down. That thread blocks trying to acquire a lock on the > {{NioSslEngine}} held by the first thread (the one doing waiting for the ack > to the put-all.) > Somehow the shutdown thread must be allowed to proceed. > Here's the hung thread in ds3 (a.k.a. vm_3_thr_34_dataStore3_host1_8592) > trying to shut down the member but it's stuck waiting for the monitor on the > {{NioSslEngine}}: > {noformat} > "vm_3_thr_34_dataStore3_host1_8592" #795 daemon prio=5 os_prio=0 > tid=0x7fdb9c011000 nid=0x2fcc waiting for monitor entry > [0x7fdb6f4b7000] >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.geode.internal.tcp.Connection.notifyHandshakeWaiter(Connection.java:804) > - waiting to lock <0xf2635b28> (a > org.apache.geode.internal.net.NioSslEngine) > at org.apache.geode.internal.tcp.Connection.close(Connection.java:1350) > at > org.apache.geode.internal.tcp.Connection.closePartialConnect(Connection.java:1278) > at > org.apache.geode.internal.tcp.ConnectionTable.closeCon(ConnectionTable.java:612) > at > org.apache.geode.internal.tcp.ConnectionTable.closeCon(ConnectionTable.java:604) > at > org.apache.geode.internal.tcp.ConnectionTable.close(ConnectionTable.java:661) > - locked <0xf2678cf8> (a java.util.ArrayList) > - locked <0xf1187348> (a java.util.concurrent.ConcurrentHashMap) > at org.apache.geode.internal.tcp.TCPConduit.stop(TCPConduit.java:487) > at > org.apache.geode.distributed.internal.direct.DirectChannel.disconnect(DirectChannel.java:644) > - locked <0xf11867a8> (a > org.apache.geode.distributed.internal.direct.DirectChannel) > at > org.apache.geode.distributed.internal.DistributionImpl.disconnectDirectChannel(DistributionImpl.java:631) > at > org.apache.geode.distributed.internal.DistributionImpl.access$200(DistributionImpl.java:82) > at > org.apache.geode.distributed.internal.DistributionImpl$LifecycleListenerImpl.disconnect(DistributionImpl.java:904) > at > org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.stop(GMSMembership.java:1908) > at > org.apache.geode.distributed.internal.membership.gms.Services.stop(Services.java:302) > at > org.apache.geode.distributed.internal.membership.gms.GMSMembership.shutdown(GMSMembership.java:1262) > at > org.apache.geode.distributed.internal.membership.gms.GMSMembership.disconnect(GMSMembership.java:1847) > at > org.apache.geode.distributed.internal.DistributionImpl.disconnect(DistributionImpl.java:501) > at > org.apache.geode.distributed.internal.ClusterDistributionManager.uncleanShutdown(ClusterDistributionManager.java:1291) > {noformat} > That thread is waiting on a lock held by this thread (in ds3) which is > waiting on an acknowledgement to a PutAllPRMessage sent to ds1. > {noformat} > "vm_3_thr_37_dataStore3_host1_8592" #857 daemon prio=5 os_prio=0 > tid=0x7fdb9c030800 nid=0x30d1 runnable [0x7fdb732f] >java.lang.Thread.State: RUNNABLE > at sun.nio.ch.FileDispatcherImpl.read0(Native Method) > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) > at sun.nio.ch.IOUtil.read(IOUtil.java:192) >
[jira] [Commented] (GEODE-8652) member hung in Connection.notifyHandshakeWaiter() during disconnect waiting for a lock held by another thread in Connection.readAck()
[ https://issues.apache.org/jira/browse/GEODE-8652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227030#comment-17227030 ] ASF GitHub Bot commented on GEODE-8652: --- Bill opened a new pull request #5712: URL: https://github.com/apache/geode/pull/5712 This is a second try fixing GEODE-8652. We committed the change a week ago but then found other problems in some applications. We've included a new concurrency test in this latest PR that validates that issue is resolved. - [x] Is there a JIRA ticket associated with this PR? Is it referenced in the commit message? - [x] Has your PR been rebased against the latest commit within the target branch (typically `develop`)? - [ ] Is your initial contribution a single, squashed commit? - [x] Does `gradlew build` run cleanly? - [x] Have you written or updated unit tests to verify your changes? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > member hung in Connection.notifyHandshakeWaiter() during disconnect waiting > for a lock held by another thread in Connection.readAck() > -- > > Key: GEODE-8652 > URL: https://issues.apache.org/jira/browse/GEODE-8652 > Project: Geode > Issue Type: Bug > Components: membership, messaging >Affects Versions: 1.12.0, 1.13.0, 1.14.0 >Reporter: Bill Burcham >Assignee: Bill Burcham >Priority: Major > Labels: pull-request-available > Fix For: 1.12.1, 1.14.0, 1.13.1 > > > An application encountered the following hang in a TLS-enabled cluster. > Let's call the cluster members ds3 -> ds1. > ds3 sends a {{PutAllPRMessage}} to ds1 and is stuck in > {{SocketChannel.read()}} waiting for the acknowledgement. > {{ClusterDistributionManager.uncleanShutdown()}} is invoked on ds3 to shut > the member down. That thread blocks trying to acquire a lock on the > {{NioSslEngine}} held by the first thread (the one doing waiting for the ack > to the put-all.) > Somehow the shutdown thread must be allowed to proceed. > Here's the hung thread in ds3 (a.k.a. vm_3_thr_34_dataStore3_host1_8592) > trying to shut down the member but it's stuck waiting for the monitor on the > {{NioSslEngine}}: > {noformat} > "vm_3_thr_34_dataStore3_host1_8592" #795 daemon prio=5 os_prio=0 > tid=0x7fdb9c011000 nid=0x2fcc waiting for monitor entry > [0x7fdb6f4b7000] >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.geode.internal.tcp.Connection.notifyHandshakeWaiter(Connection.java:804) > - waiting to lock <0xf2635b28> (a > org.apache.geode.internal.net.NioSslEngine) > at org.apache.geode.internal.tcp.Connection.close(Connection.java:1350) > at > org.apache.geode.internal.tcp.Connection.closePartialConnect(Connection.java:1278) > at > org.apache.geode.internal.tcp.ConnectionTable.closeCon(ConnectionTable.java:612) > at > org.apache.geode.internal.tcp.ConnectionTable.closeCon(ConnectionTable.java:604) > at > org.apache.geode.internal.tcp.ConnectionTable.close(ConnectionTable.java:661) > - locked <0xf2678cf8> (a java.util.ArrayList) > - locked <0xf1187348> (a java.util.concurrent.ConcurrentHashMap) > at org.apache.geode.internal.tcp.TCPConduit.stop(TCPConduit.java:487) > at > org.apache.geode.distributed.internal.direct.DirectChannel.disconnect(DirectChannel.java:644) > - locked <0xf11867a8> (a > org.apache.geode.distributed.internal.direct.DirectChannel) > at > org.apache.geode.distributed.internal.DistributionImpl.disconnectDirectChannel(DistributionImpl.java:631) > at > org.apache.geode.distributed.internal.DistributionImpl.access$200(DistributionImpl.java:82) > at > org.apache.geode.distributed.internal.DistributionImpl$LifecycleListenerImpl.disconnect(DistributionImpl.java:904) > at > org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.stop(GMSMembership.java:1908) > at > org.apache.geode.distributed.internal.membership.gms.Services.stop(Services.java:302) > at > org.apache.geode.distributed.internal.membership.gms.GMSMembership.shutdown(GMSMembership.java:1262) > at > org.apache.geode.distributed.internal.membership.gms.GMSMembership.disconnect(GMSMembership.java:1847) > at > org.apache.geode.distributed.internal.DistributionImpl.disconnect(DistributionImpl.java:501) > at > or
[jira] [Updated] (GEODE-8540) Repackage DUnitBlackboard in a JUnit Rule
[ https://issues.apache.org/jira/browse/GEODE-8540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Owen Nichols updated GEODE-8540: Fix Version/s: (was: 1.13.1) > Repackage DUnitBlackboard in a JUnit Rule > - > > Key: GEODE-8540 > URL: https://issues.apache.org/jira/browse/GEODE-8540 > Project: Geode > Issue Type: Wish > Components: tests >Reporter: Kirk Lund >Assignee: Kirk Lund >Priority: Major > Labels: pull-request-available > Fix For: 1.14.0 > > > Repackage DUnitBlackboard in a JUnit Rule -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (GEODE-8136) Repackage and improve javadocs for UncheckedUtils
[ https://issues.apache.org/jira/browse/GEODE-8136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Owen Nichols updated GEODE-8136: Fix Version/s: (was: 1.13.1) > Repackage and improve javadocs for UncheckedUtils > - > > Key: GEODE-8136 > URL: https://issues.apache.org/jira/browse/GEODE-8136 > Project: Geode > Issue Type: Wish > Components: core >Reporter: Kirk Lund >Assignee: Kirk Lund >Priority: Major > Fix For: 1.14.0 > > > UncheckedUtils is a collection of simple utilities for unchecked casts in > both test and product code. We should move it to the most common module it > can live in, rename methods to be more description, and add javadocs. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-7727) Geode P2P connection hanging
[ https://issues.apache.org/jira/browse/GEODE-7727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227041#comment-17227041 ] ASF subversion and git services commented on GEODE-7727: Commit 798a245147835c1e1b0026e863b9816a3ce2c551 in geode's branch refs/heads/support/1.12 from Mario Ivanac [ https://gitbox.apache.org/repos/asf?p=geode.git;h=798a245 ] GEODE-7727: modify sender thread to detect relese of connection (#4751) * GEODE-7727: modify sender thread to detect relese of connection * GEODE-7727: Update solution only for shared connections * GEODE-7727: added test * GEODE-7727: update ater comments * GEODE-7727: update test * GEODE-7727: fix for async write hanging * GEODE-7727: Test of region operations in the face of closed connections Adding a test for what happens to region operations when a connection is closed out from under the system. This test hangs without the changes to let the reader thread keep running. Fix to test * GEODE-7727: Preventing a double release of the input buffer The releaseInputBuffer method was not thread safe. If it is called concurrently, it will end up being released twice, which will add the buffer to to the buffer pool twice. Later, this could result in two threads using the same buffer, resulting in corruption of the buffer. With the changes for GEODE-7727, we made it likely that releaseInputBuffer would be called concurrently. If a member departs, one thread will call Connection.close. Connection.close will close the socket and call releaseInputBuffer. However, closing the socket will wake up the reader thread, which will also call releaseInputBuffer concurrently. Making releaseInputBuffer thread safe by introducing a lock. * GEODE-7727: update after merge * GEODE-7727: update test name Co-authored-by: Dan Smith (cherry picked from commit c8413592e5573f675c538c63ef9ee9f97a349e73) > Geode P2P connection hanging > > > Key: GEODE-7727 > URL: https://issues.apache.org/jira/browse/GEODE-7727 > Project: Geode > Issue Type: Bug >Reporter: Mario Ivanac >Assignee: Mario Ivanac >Priority: Major > Labels: needs-review, pull-request-available > Fix For: 1.13.0 > > Time Spent: 3.5h > Remaining Estimate: 0h > > {color:#172b4d}Geode P2P handshake reader stops listening to it's socket once > the handshake between 2 peers is established. This seems to be a design > choice. > {color} > {color:#172b4d}The problem is when the connection gets killed (TCP FIN). > Since nothing is listening on the socket, nothing will get that FIN package > and close the connection. The connection is left hanging (CLOSE-WAIT state). > The peers are then unable to establish proper P2P communication later.{color} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-7727) Geode P2P connection hanging
[ https://issues.apache.org/jira/browse/GEODE-7727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227037#comment-17227037 ] ASF subversion and git services commented on GEODE-7727: Commit 798a245147835c1e1b0026e863b9816a3ce2c551 in geode's branch refs/heads/support/1.12 from Mario Ivanac [ https://gitbox.apache.org/repos/asf?p=geode.git;h=798a245 ] GEODE-7727: modify sender thread to detect relese of connection (#4751) * GEODE-7727: modify sender thread to detect relese of connection * GEODE-7727: Update solution only for shared connections * GEODE-7727: added test * GEODE-7727: update ater comments * GEODE-7727: update test * GEODE-7727: fix for async write hanging * GEODE-7727: Test of region operations in the face of closed connections Adding a test for what happens to region operations when a connection is closed out from under the system. This test hangs without the changes to let the reader thread keep running. Fix to test * GEODE-7727: Preventing a double release of the input buffer The releaseInputBuffer method was not thread safe. If it is called concurrently, it will end up being released twice, which will add the buffer to to the buffer pool twice. Later, this could result in two threads using the same buffer, resulting in corruption of the buffer. With the changes for GEODE-7727, we made it likely that releaseInputBuffer would be called concurrently. If a member departs, one thread will call Connection.close. Connection.close will close the socket and call releaseInputBuffer. However, closing the socket will wake up the reader thread, which will also call releaseInputBuffer concurrently. Making releaseInputBuffer thread safe by introducing a lock. * GEODE-7727: update after merge * GEODE-7727: update test name Co-authored-by: Dan Smith (cherry picked from commit c8413592e5573f675c538c63ef9ee9f97a349e73) > Geode P2P connection hanging > > > Key: GEODE-7727 > URL: https://issues.apache.org/jira/browse/GEODE-7727 > Project: Geode > Issue Type: Bug >Reporter: Mario Ivanac >Assignee: Mario Ivanac >Priority: Major > Labels: needs-review, pull-request-available > Fix For: 1.13.0 > > Time Spent: 3.5h > Remaining Estimate: 0h > > {color:#172b4d}Geode P2P handshake reader stops listening to it's socket once > the handshake between 2 peers is established. This seems to be a design > choice. > {color} > {color:#172b4d}The problem is when the connection gets killed (TCP FIN). > Since nothing is listening on the socket, nothing will get that FIN package > and close the connection. The connection is left hanging (CLOSE-WAIT state). > The peers are then unable to establish proper P2P communication later.{color} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-7727) Geode P2P connection hanging
[ https://issues.apache.org/jira/browse/GEODE-7727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227040#comment-17227040 ] ASF subversion and git services commented on GEODE-7727: Commit 798a245147835c1e1b0026e863b9816a3ce2c551 in geode's branch refs/heads/support/1.12 from Mario Ivanac [ https://gitbox.apache.org/repos/asf?p=geode.git;h=798a245 ] GEODE-7727: modify sender thread to detect relese of connection (#4751) * GEODE-7727: modify sender thread to detect relese of connection * GEODE-7727: Update solution only for shared connections * GEODE-7727: added test * GEODE-7727: update ater comments * GEODE-7727: update test * GEODE-7727: fix for async write hanging * GEODE-7727: Test of region operations in the face of closed connections Adding a test for what happens to region operations when a connection is closed out from under the system. This test hangs without the changes to let the reader thread keep running. Fix to test * GEODE-7727: Preventing a double release of the input buffer The releaseInputBuffer method was not thread safe. If it is called concurrently, it will end up being released twice, which will add the buffer to to the buffer pool twice. Later, this could result in two threads using the same buffer, resulting in corruption of the buffer. With the changes for GEODE-7727, we made it likely that releaseInputBuffer would be called concurrently. If a member departs, one thread will call Connection.close. Connection.close will close the socket and call releaseInputBuffer. However, closing the socket will wake up the reader thread, which will also call releaseInputBuffer concurrently. Making releaseInputBuffer thread safe by introducing a lock. * GEODE-7727: update after merge * GEODE-7727: update test name Co-authored-by: Dan Smith (cherry picked from commit c8413592e5573f675c538c63ef9ee9f97a349e73) > Geode P2P connection hanging > > > Key: GEODE-7727 > URL: https://issues.apache.org/jira/browse/GEODE-7727 > Project: Geode > Issue Type: Bug >Reporter: Mario Ivanac >Assignee: Mario Ivanac >Priority: Major > Labels: needs-review, pull-request-available > Fix For: 1.13.0 > > Time Spent: 3.5h > Remaining Estimate: 0h > > {color:#172b4d}Geode P2P handshake reader stops listening to it's socket once > the handshake between 2 peers is established. This seems to be a design > choice. > {color} > {color:#172b4d}The problem is when the connection gets killed (TCP FIN). > Since nothing is listening on the socket, nothing will get that FIN package > and close the connection. The connection is left hanging (CLOSE-WAIT state). > The peers are then unable to establish proper P2P communication later.{color} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-7727) Geode P2P connection hanging
[ https://issues.apache.org/jira/browse/GEODE-7727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227039#comment-17227039 ] ASF subversion and git services commented on GEODE-7727: Commit 798a245147835c1e1b0026e863b9816a3ce2c551 in geode's branch refs/heads/support/1.12 from Mario Ivanac [ https://gitbox.apache.org/repos/asf?p=geode.git;h=798a245 ] GEODE-7727: modify sender thread to detect relese of connection (#4751) * GEODE-7727: modify sender thread to detect relese of connection * GEODE-7727: Update solution only for shared connections * GEODE-7727: added test * GEODE-7727: update ater comments * GEODE-7727: update test * GEODE-7727: fix for async write hanging * GEODE-7727: Test of region operations in the face of closed connections Adding a test for what happens to region operations when a connection is closed out from under the system. This test hangs without the changes to let the reader thread keep running. Fix to test * GEODE-7727: Preventing a double release of the input buffer The releaseInputBuffer method was not thread safe. If it is called concurrently, it will end up being released twice, which will add the buffer to to the buffer pool twice. Later, this could result in two threads using the same buffer, resulting in corruption of the buffer. With the changes for GEODE-7727, we made it likely that releaseInputBuffer would be called concurrently. If a member departs, one thread will call Connection.close. Connection.close will close the socket and call releaseInputBuffer. However, closing the socket will wake up the reader thread, which will also call releaseInputBuffer concurrently. Making releaseInputBuffer thread safe by introducing a lock. * GEODE-7727: update after merge * GEODE-7727: update test name Co-authored-by: Dan Smith (cherry picked from commit c8413592e5573f675c538c63ef9ee9f97a349e73) > Geode P2P connection hanging > > > Key: GEODE-7727 > URL: https://issues.apache.org/jira/browse/GEODE-7727 > Project: Geode > Issue Type: Bug >Reporter: Mario Ivanac >Assignee: Mario Ivanac >Priority: Major > Labels: needs-review, pull-request-available > Fix For: 1.13.0 > > Time Spent: 3.5h > Remaining Estimate: 0h > > {color:#172b4d}Geode P2P handshake reader stops listening to it's socket once > the handshake between 2 peers is established. This seems to be a design > choice. > {color} > {color:#172b4d}The problem is when the connection gets killed (TCP FIN). > Since nothing is listening on the socket, nothing will get that FIN package > and close the connection. The connection is left hanging (CLOSE-WAIT state). > The peers are then unable to establish proper P2P communication later.{color} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-7727) Geode P2P connection hanging
[ https://issues.apache.org/jira/browse/GEODE-7727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227042#comment-17227042 ] ASF subversion and git services commented on GEODE-7727: Commit 798a245147835c1e1b0026e863b9816a3ce2c551 in geode's branch refs/heads/support/1.12 from Mario Ivanac [ https://gitbox.apache.org/repos/asf?p=geode.git;h=798a245 ] GEODE-7727: modify sender thread to detect relese of connection (#4751) * GEODE-7727: modify sender thread to detect relese of connection * GEODE-7727: Update solution only for shared connections * GEODE-7727: added test * GEODE-7727: update ater comments * GEODE-7727: update test * GEODE-7727: fix for async write hanging * GEODE-7727: Test of region operations in the face of closed connections Adding a test for what happens to region operations when a connection is closed out from under the system. This test hangs without the changes to let the reader thread keep running. Fix to test * GEODE-7727: Preventing a double release of the input buffer The releaseInputBuffer method was not thread safe. If it is called concurrently, it will end up being released twice, which will add the buffer to to the buffer pool twice. Later, this could result in two threads using the same buffer, resulting in corruption of the buffer. With the changes for GEODE-7727, we made it likely that releaseInputBuffer would be called concurrently. If a member departs, one thread will call Connection.close. Connection.close will close the socket and call releaseInputBuffer. However, closing the socket will wake up the reader thread, which will also call releaseInputBuffer concurrently. Making releaseInputBuffer thread safe by introducing a lock. * GEODE-7727: update after merge * GEODE-7727: update test name Co-authored-by: Dan Smith (cherry picked from commit c8413592e5573f675c538c63ef9ee9f97a349e73) > Geode P2P connection hanging > > > Key: GEODE-7727 > URL: https://issues.apache.org/jira/browse/GEODE-7727 > Project: Geode > Issue Type: Bug >Reporter: Mario Ivanac >Assignee: Mario Ivanac >Priority: Major > Labels: needs-review, pull-request-available > Fix For: 1.13.0 > > Time Spent: 3.5h > Remaining Estimate: 0h > > {color:#172b4d}Geode P2P handshake reader stops listening to it's socket once > the handshake between 2 peers is established. This seems to be a design > choice. > {color} > {color:#172b4d}The problem is when the connection gets killed (TCP FIN). > Since nothing is listening on the socket, nothing will get that FIN package > and close the connection. The connection is left hanging (CLOSE-WAIT state). > The peers are then unable to establish proper P2P communication later.{color} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-7727) Geode P2P connection hanging
[ https://issues.apache.org/jira/browse/GEODE-7727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227038#comment-17227038 ] ASF subversion and git services commented on GEODE-7727: Commit 798a245147835c1e1b0026e863b9816a3ce2c551 in geode's branch refs/heads/support/1.12 from Mario Ivanac [ https://gitbox.apache.org/repos/asf?p=geode.git;h=798a245 ] GEODE-7727: modify sender thread to detect relese of connection (#4751) * GEODE-7727: modify sender thread to detect relese of connection * GEODE-7727: Update solution only for shared connections * GEODE-7727: added test * GEODE-7727: update ater comments * GEODE-7727: update test * GEODE-7727: fix for async write hanging * GEODE-7727: Test of region operations in the face of closed connections Adding a test for what happens to region operations when a connection is closed out from under the system. This test hangs without the changes to let the reader thread keep running. Fix to test * GEODE-7727: Preventing a double release of the input buffer The releaseInputBuffer method was not thread safe. If it is called concurrently, it will end up being released twice, which will add the buffer to to the buffer pool twice. Later, this could result in two threads using the same buffer, resulting in corruption of the buffer. With the changes for GEODE-7727, we made it likely that releaseInputBuffer would be called concurrently. If a member departs, one thread will call Connection.close. Connection.close will close the socket and call releaseInputBuffer. However, closing the socket will wake up the reader thread, which will also call releaseInputBuffer concurrently. Making releaseInputBuffer thread safe by introducing a lock. * GEODE-7727: update after merge * GEODE-7727: update test name Co-authored-by: Dan Smith (cherry picked from commit c8413592e5573f675c538c63ef9ee9f97a349e73) > Geode P2P connection hanging > > > Key: GEODE-7727 > URL: https://issues.apache.org/jira/browse/GEODE-7727 > Project: Geode > Issue Type: Bug >Reporter: Mario Ivanac >Assignee: Mario Ivanac >Priority: Major > Labels: needs-review, pull-request-available > Fix For: 1.13.0 > > Time Spent: 3.5h > Remaining Estimate: 0h > > {color:#172b4d}Geode P2P handshake reader stops listening to it's socket once > the handshake between 2 peers is established. This seems to be a design > choice. > {color} > {color:#172b4d}The problem is when the connection gets killed (TCP FIN). > Since nothing is listening on the socket, nothing will get that FIN package > and close the connection. The connection is left hanging (CLOSE-WAIT state). > The peers are then unable to establish proper P2P communication later.{color} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-7727) Geode P2P connection hanging
[ https://issues.apache.org/jira/browse/GEODE-7727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227054#comment-17227054 ] ASF subversion and git services commented on GEODE-7727: Commit 798a245147835c1e1b0026e863b9816a3ce2c551 in geode's branch refs/heads/support/1.12 from Mario Ivanac [ https://gitbox.apache.org/repos/asf?p=geode.git;h=798a245 ] GEODE-7727: modify sender thread to detect relese of connection (#4751) * GEODE-7727: modify sender thread to detect relese of connection * GEODE-7727: Update solution only for shared connections * GEODE-7727: added test * GEODE-7727: update ater comments * GEODE-7727: update test * GEODE-7727: fix for async write hanging * GEODE-7727: Test of region operations in the face of closed connections Adding a test for what happens to region operations when a connection is closed out from under the system. This test hangs without the changes to let the reader thread keep running. Fix to test * GEODE-7727: Preventing a double release of the input buffer The releaseInputBuffer method was not thread safe. If it is called concurrently, it will end up being released twice, which will add the buffer to to the buffer pool twice. Later, this could result in two threads using the same buffer, resulting in corruption of the buffer. With the changes for GEODE-7727, we made it likely that releaseInputBuffer would be called concurrently. If a member departs, one thread will call Connection.close. Connection.close will close the socket and call releaseInputBuffer. However, closing the socket will wake up the reader thread, which will also call releaseInputBuffer concurrently. Making releaseInputBuffer thread safe by introducing a lock. * GEODE-7727: update after merge * GEODE-7727: update test name Co-authored-by: Dan Smith (cherry picked from commit c8413592e5573f675c538c63ef9ee9f97a349e73) > Geode P2P connection hanging > > > Key: GEODE-7727 > URL: https://issues.apache.org/jira/browse/GEODE-7727 > Project: Geode > Issue Type: Bug >Reporter: Mario Ivanac >Assignee: Mario Ivanac >Priority: Major > Labels: needs-review, pull-request-available > Fix For: 1.13.0 > > Time Spent: 3.5h > Remaining Estimate: 0h > > {color:#172b4d}Geode P2P handshake reader stops listening to it's socket once > the handshake between 2 peers is established. This seems to be a design > choice. > {color} > {color:#172b4d}The problem is when the connection gets killed (TCP FIN). > Since nothing is listening on the socket, nothing will get that FIN package > and close the connection. The connection is left hanging (CLOSE-WAIT state). > The peers are then unable to establish proper P2P communication later.{color} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-7727) Geode P2P connection hanging
[ https://issues.apache.org/jira/browse/GEODE-7727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227044#comment-17227044 ] ASF subversion and git services commented on GEODE-7727: Commit 798a245147835c1e1b0026e863b9816a3ce2c551 in geode's branch refs/heads/support/1.12 from Mario Ivanac [ https://gitbox.apache.org/repos/asf?p=geode.git;h=798a245 ] GEODE-7727: modify sender thread to detect relese of connection (#4751) * GEODE-7727: modify sender thread to detect relese of connection * GEODE-7727: Update solution only for shared connections * GEODE-7727: added test * GEODE-7727: update ater comments * GEODE-7727: update test * GEODE-7727: fix for async write hanging * GEODE-7727: Test of region operations in the face of closed connections Adding a test for what happens to region operations when a connection is closed out from under the system. This test hangs without the changes to let the reader thread keep running. Fix to test * GEODE-7727: Preventing a double release of the input buffer The releaseInputBuffer method was not thread safe. If it is called concurrently, it will end up being released twice, which will add the buffer to to the buffer pool twice. Later, this could result in two threads using the same buffer, resulting in corruption of the buffer. With the changes for GEODE-7727, we made it likely that releaseInputBuffer would be called concurrently. If a member departs, one thread will call Connection.close. Connection.close will close the socket and call releaseInputBuffer. However, closing the socket will wake up the reader thread, which will also call releaseInputBuffer concurrently. Making releaseInputBuffer thread safe by introducing a lock. * GEODE-7727: update after merge * GEODE-7727: update test name Co-authored-by: Dan Smith (cherry picked from commit c8413592e5573f675c538c63ef9ee9f97a349e73) > Geode P2P connection hanging > > > Key: GEODE-7727 > URL: https://issues.apache.org/jira/browse/GEODE-7727 > Project: Geode > Issue Type: Bug >Reporter: Mario Ivanac >Assignee: Mario Ivanac >Priority: Major > Labels: needs-review, pull-request-available > Fix For: 1.13.0 > > Time Spent: 3.5h > Remaining Estimate: 0h > > {color:#172b4d}Geode P2P handshake reader stops listening to it's socket once > the handshake between 2 peers is established. This seems to be a design > choice. > {color} > {color:#172b4d}The problem is when the connection gets killed (TCP FIN). > Since nothing is listening on the socket, nothing will get that FIN package > and close the connection. The connection is left hanging (CLOSE-WAIT state). > The peers are then unable to establish proper P2P communication later.{color} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-7727) Geode P2P connection hanging
[ https://issues.apache.org/jira/browse/GEODE-7727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227053#comment-17227053 ] ASF subversion and git services commented on GEODE-7727: Commit 798a245147835c1e1b0026e863b9816a3ce2c551 in geode's branch refs/heads/support/1.12 from Mario Ivanac [ https://gitbox.apache.org/repos/asf?p=geode.git;h=798a245 ] GEODE-7727: modify sender thread to detect relese of connection (#4751) * GEODE-7727: modify sender thread to detect relese of connection * GEODE-7727: Update solution only for shared connections * GEODE-7727: added test * GEODE-7727: update ater comments * GEODE-7727: update test * GEODE-7727: fix for async write hanging * GEODE-7727: Test of region operations in the face of closed connections Adding a test for what happens to region operations when a connection is closed out from under the system. This test hangs without the changes to let the reader thread keep running. Fix to test * GEODE-7727: Preventing a double release of the input buffer The releaseInputBuffer method was not thread safe. If it is called concurrently, it will end up being released twice, which will add the buffer to to the buffer pool twice. Later, this could result in two threads using the same buffer, resulting in corruption of the buffer. With the changes for GEODE-7727, we made it likely that releaseInputBuffer would be called concurrently. If a member departs, one thread will call Connection.close. Connection.close will close the socket and call releaseInputBuffer. However, closing the socket will wake up the reader thread, which will also call releaseInputBuffer concurrently. Making releaseInputBuffer thread safe by introducing a lock. * GEODE-7727: update after merge * GEODE-7727: update test name Co-authored-by: Dan Smith (cherry picked from commit c8413592e5573f675c538c63ef9ee9f97a349e73) > Geode P2P connection hanging > > > Key: GEODE-7727 > URL: https://issues.apache.org/jira/browse/GEODE-7727 > Project: Geode > Issue Type: Bug >Reporter: Mario Ivanac >Assignee: Mario Ivanac >Priority: Major > Labels: needs-review, pull-request-available > Fix For: 1.13.0 > > Time Spent: 3.5h > Remaining Estimate: 0h > > {color:#172b4d}Geode P2P handshake reader stops listening to it's socket once > the handshake between 2 peers is established. This seems to be a design > choice. > {color} > {color:#172b4d}The problem is when the connection gets killed (TCP FIN). > Since nothing is listening on the socket, nothing will get that FIN package > and close the connection. The connection is left hanging (CLOSE-WAIT state). > The peers are then unable to establish proper P2P communication later.{color} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-7727) Geode P2P connection hanging
[ https://issues.apache.org/jira/browse/GEODE-7727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227047#comment-17227047 ] ASF subversion and git services commented on GEODE-7727: Commit 798a245147835c1e1b0026e863b9816a3ce2c551 in geode's branch refs/heads/support/1.12 from Mario Ivanac [ https://gitbox.apache.org/repos/asf?p=geode.git;h=798a245 ] GEODE-7727: modify sender thread to detect relese of connection (#4751) * GEODE-7727: modify sender thread to detect relese of connection * GEODE-7727: Update solution only for shared connections * GEODE-7727: added test * GEODE-7727: update ater comments * GEODE-7727: update test * GEODE-7727: fix for async write hanging * GEODE-7727: Test of region operations in the face of closed connections Adding a test for what happens to region operations when a connection is closed out from under the system. This test hangs without the changes to let the reader thread keep running. Fix to test * GEODE-7727: Preventing a double release of the input buffer The releaseInputBuffer method was not thread safe. If it is called concurrently, it will end up being released twice, which will add the buffer to to the buffer pool twice. Later, this could result in two threads using the same buffer, resulting in corruption of the buffer. With the changes for GEODE-7727, we made it likely that releaseInputBuffer would be called concurrently. If a member departs, one thread will call Connection.close. Connection.close will close the socket and call releaseInputBuffer. However, closing the socket will wake up the reader thread, which will also call releaseInputBuffer concurrently. Making releaseInputBuffer thread safe by introducing a lock. * GEODE-7727: update after merge * GEODE-7727: update test name Co-authored-by: Dan Smith (cherry picked from commit c8413592e5573f675c538c63ef9ee9f97a349e73) > Geode P2P connection hanging > > > Key: GEODE-7727 > URL: https://issues.apache.org/jira/browse/GEODE-7727 > Project: Geode > Issue Type: Bug >Reporter: Mario Ivanac >Assignee: Mario Ivanac >Priority: Major > Labels: needs-review, pull-request-available > Fix For: 1.13.0 > > Time Spent: 3.5h > Remaining Estimate: 0h > > {color:#172b4d}Geode P2P handshake reader stops listening to it's socket once > the handshake between 2 peers is established. This seems to be a design > choice. > {color} > {color:#172b4d}The problem is when the connection gets killed (TCP FIN). > Since nothing is listening on the socket, nothing will get that FIN package > and close the connection. The connection is left hanging (CLOSE-WAIT state). > The peers are then unable to establish proper P2P communication later.{color} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-7727) Geode P2P connection hanging
[ https://issues.apache.org/jira/browse/GEODE-7727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227046#comment-17227046 ] ASF subversion and git services commented on GEODE-7727: Commit 798a245147835c1e1b0026e863b9816a3ce2c551 in geode's branch refs/heads/support/1.12 from Mario Ivanac [ https://gitbox.apache.org/repos/asf?p=geode.git;h=798a245 ] GEODE-7727: modify sender thread to detect relese of connection (#4751) * GEODE-7727: modify sender thread to detect relese of connection * GEODE-7727: Update solution only for shared connections * GEODE-7727: added test * GEODE-7727: update ater comments * GEODE-7727: update test * GEODE-7727: fix for async write hanging * GEODE-7727: Test of region operations in the face of closed connections Adding a test for what happens to region operations when a connection is closed out from under the system. This test hangs without the changes to let the reader thread keep running. Fix to test * GEODE-7727: Preventing a double release of the input buffer The releaseInputBuffer method was not thread safe. If it is called concurrently, it will end up being released twice, which will add the buffer to to the buffer pool twice. Later, this could result in two threads using the same buffer, resulting in corruption of the buffer. With the changes for GEODE-7727, we made it likely that releaseInputBuffer would be called concurrently. If a member departs, one thread will call Connection.close. Connection.close will close the socket and call releaseInputBuffer. However, closing the socket will wake up the reader thread, which will also call releaseInputBuffer concurrently. Making releaseInputBuffer thread safe by introducing a lock. * GEODE-7727: update after merge * GEODE-7727: update test name Co-authored-by: Dan Smith (cherry picked from commit c8413592e5573f675c538c63ef9ee9f97a349e73) > Geode P2P connection hanging > > > Key: GEODE-7727 > URL: https://issues.apache.org/jira/browse/GEODE-7727 > Project: Geode > Issue Type: Bug >Reporter: Mario Ivanac >Assignee: Mario Ivanac >Priority: Major > Labels: needs-review, pull-request-available > Fix For: 1.13.0 > > Time Spent: 3.5h > Remaining Estimate: 0h > > {color:#172b4d}Geode P2P handshake reader stops listening to it's socket once > the handshake between 2 peers is established. This seems to be a design > choice. > {color} > {color:#172b4d}The problem is when the connection gets killed (TCP FIN). > Since nothing is listening on the socket, nothing will get that FIN package > and close the connection. The connection is left hanging (CLOSE-WAIT state). > The peers are then unable to establish proper P2P communication later.{color} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-7727) Geode P2P connection hanging
[ https://issues.apache.org/jira/browse/GEODE-7727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227051#comment-17227051 ] ASF subversion and git services commented on GEODE-7727: Commit 798a245147835c1e1b0026e863b9816a3ce2c551 in geode's branch refs/heads/support/1.12 from Mario Ivanac [ https://gitbox.apache.org/repos/asf?p=geode.git;h=798a245 ] GEODE-7727: modify sender thread to detect relese of connection (#4751) * GEODE-7727: modify sender thread to detect relese of connection * GEODE-7727: Update solution only for shared connections * GEODE-7727: added test * GEODE-7727: update ater comments * GEODE-7727: update test * GEODE-7727: fix for async write hanging * GEODE-7727: Test of region operations in the face of closed connections Adding a test for what happens to region operations when a connection is closed out from under the system. This test hangs without the changes to let the reader thread keep running. Fix to test * GEODE-7727: Preventing a double release of the input buffer The releaseInputBuffer method was not thread safe. If it is called concurrently, it will end up being released twice, which will add the buffer to to the buffer pool twice. Later, this could result in two threads using the same buffer, resulting in corruption of the buffer. With the changes for GEODE-7727, we made it likely that releaseInputBuffer would be called concurrently. If a member departs, one thread will call Connection.close. Connection.close will close the socket and call releaseInputBuffer. However, closing the socket will wake up the reader thread, which will also call releaseInputBuffer concurrently. Making releaseInputBuffer thread safe by introducing a lock. * GEODE-7727: update after merge * GEODE-7727: update test name Co-authored-by: Dan Smith (cherry picked from commit c8413592e5573f675c538c63ef9ee9f97a349e73) > Geode P2P connection hanging > > > Key: GEODE-7727 > URL: https://issues.apache.org/jira/browse/GEODE-7727 > Project: Geode > Issue Type: Bug >Reporter: Mario Ivanac >Assignee: Mario Ivanac >Priority: Major > Labels: needs-review, pull-request-available > Fix For: 1.13.0 > > Time Spent: 3.5h > Remaining Estimate: 0h > > {color:#172b4d}Geode P2P handshake reader stops listening to it's socket once > the handshake between 2 peers is established. This seems to be a design > choice. > {color} > {color:#172b4d}The problem is when the connection gets killed (TCP FIN). > Since nothing is listening on the socket, nothing will get that FIN package > and close the connection. The connection is left hanging (CLOSE-WAIT state). > The peers are then unable to establish proper P2P communication later.{color} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-7727) Geode P2P connection hanging
[ https://issues.apache.org/jira/browse/GEODE-7727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227049#comment-17227049 ] ASF subversion and git services commented on GEODE-7727: Commit 798a245147835c1e1b0026e863b9816a3ce2c551 in geode's branch refs/heads/support/1.12 from Mario Ivanac [ https://gitbox.apache.org/repos/asf?p=geode.git;h=798a245 ] GEODE-7727: modify sender thread to detect relese of connection (#4751) * GEODE-7727: modify sender thread to detect relese of connection * GEODE-7727: Update solution only for shared connections * GEODE-7727: added test * GEODE-7727: update ater comments * GEODE-7727: update test * GEODE-7727: fix for async write hanging * GEODE-7727: Test of region operations in the face of closed connections Adding a test for what happens to region operations when a connection is closed out from under the system. This test hangs without the changes to let the reader thread keep running. Fix to test * GEODE-7727: Preventing a double release of the input buffer The releaseInputBuffer method was not thread safe. If it is called concurrently, it will end up being released twice, which will add the buffer to to the buffer pool twice. Later, this could result in two threads using the same buffer, resulting in corruption of the buffer. With the changes for GEODE-7727, we made it likely that releaseInputBuffer would be called concurrently. If a member departs, one thread will call Connection.close. Connection.close will close the socket and call releaseInputBuffer. However, closing the socket will wake up the reader thread, which will also call releaseInputBuffer concurrently. Making releaseInputBuffer thread safe by introducing a lock. * GEODE-7727: update after merge * GEODE-7727: update test name Co-authored-by: Dan Smith (cherry picked from commit c8413592e5573f675c538c63ef9ee9f97a349e73) > Geode P2P connection hanging > > > Key: GEODE-7727 > URL: https://issues.apache.org/jira/browse/GEODE-7727 > Project: Geode > Issue Type: Bug >Reporter: Mario Ivanac >Assignee: Mario Ivanac >Priority: Major > Labels: needs-review, pull-request-available > Fix For: 1.13.0 > > Time Spent: 3.5h > Remaining Estimate: 0h > > {color:#172b4d}Geode P2P handshake reader stops listening to it's socket once > the handshake between 2 peers is established. This seems to be a design > choice. > {color} > {color:#172b4d}The problem is when the connection gets killed (TCP FIN). > Since nothing is listening on the socket, nothing will get that FIN package > and close the connection. The connection is left hanging (CLOSE-WAIT state). > The peers are then unable to establish proper P2P communication later.{color} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-7727) Geode P2P connection hanging
[ https://issues.apache.org/jira/browse/GEODE-7727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227045#comment-17227045 ] ASF subversion and git services commented on GEODE-7727: Commit 798a245147835c1e1b0026e863b9816a3ce2c551 in geode's branch refs/heads/support/1.12 from Mario Ivanac [ https://gitbox.apache.org/repos/asf?p=geode.git;h=798a245 ] GEODE-7727: modify sender thread to detect relese of connection (#4751) * GEODE-7727: modify sender thread to detect relese of connection * GEODE-7727: Update solution only for shared connections * GEODE-7727: added test * GEODE-7727: update ater comments * GEODE-7727: update test * GEODE-7727: fix for async write hanging * GEODE-7727: Test of region operations in the face of closed connections Adding a test for what happens to region operations when a connection is closed out from under the system. This test hangs without the changes to let the reader thread keep running. Fix to test * GEODE-7727: Preventing a double release of the input buffer The releaseInputBuffer method was not thread safe. If it is called concurrently, it will end up being released twice, which will add the buffer to to the buffer pool twice. Later, this could result in two threads using the same buffer, resulting in corruption of the buffer. With the changes for GEODE-7727, we made it likely that releaseInputBuffer would be called concurrently. If a member departs, one thread will call Connection.close. Connection.close will close the socket and call releaseInputBuffer. However, closing the socket will wake up the reader thread, which will also call releaseInputBuffer concurrently. Making releaseInputBuffer thread safe by introducing a lock. * GEODE-7727: update after merge * GEODE-7727: update test name Co-authored-by: Dan Smith (cherry picked from commit c8413592e5573f675c538c63ef9ee9f97a349e73) > Geode P2P connection hanging > > > Key: GEODE-7727 > URL: https://issues.apache.org/jira/browse/GEODE-7727 > Project: Geode > Issue Type: Bug >Reporter: Mario Ivanac >Assignee: Mario Ivanac >Priority: Major > Labels: needs-review, pull-request-available > Fix For: 1.13.0 > > Time Spent: 3.5h > Remaining Estimate: 0h > > {color:#172b4d}Geode P2P handshake reader stops listening to it's socket once > the handshake between 2 peers is established. This seems to be a design > choice. > {color} > {color:#172b4d}The problem is when the connection gets killed (TCP FIN). > Since nothing is listening on the socket, nothing will get that FIN package > and close the connection. The connection is left hanging (CLOSE-WAIT state). > The peers are then unable to establish proper P2P communication later.{color} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-7727) Geode P2P connection hanging
[ https://issues.apache.org/jira/browse/GEODE-7727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227052#comment-17227052 ] ASF subversion and git services commented on GEODE-7727: Commit 798a245147835c1e1b0026e863b9816a3ce2c551 in geode's branch refs/heads/support/1.12 from Mario Ivanac [ https://gitbox.apache.org/repos/asf?p=geode.git;h=798a245 ] GEODE-7727: modify sender thread to detect relese of connection (#4751) * GEODE-7727: modify sender thread to detect relese of connection * GEODE-7727: Update solution only for shared connections * GEODE-7727: added test * GEODE-7727: update ater comments * GEODE-7727: update test * GEODE-7727: fix for async write hanging * GEODE-7727: Test of region operations in the face of closed connections Adding a test for what happens to region operations when a connection is closed out from under the system. This test hangs without the changes to let the reader thread keep running. Fix to test * GEODE-7727: Preventing a double release of the input buffer The releaseInputBuffer method was not thread safe. If it is called concurrently, it will end up being released twice, which will add the buffer to to the buffer pool twice. Later, this could result in two threads using the same buffer, resulting in corruption of the buffer. With the changes for GEODE-7727, we made it likely that releaseInputBuffer would be called concurrently. If a member departs, one thread will call Connection.close. Connection.close will close the socket and call releaseInputBuffer. However, closing the socket will wake up the reader thread, which will also call releaseInputBuffer concurrently. Making releaseInputBuffer thread safe by introducing a lock. * GEODE-7727: update after merge * GEODE-7727: update test name Co-authored-by: Dan Smith (cherry picked from commit c8413592e5573f675c538c63ef9ee9f97a349e73) > Geode P2P connection hanging > > > Key: GEODE-7727 > URL: https://issues.apache.org/jira/browse/GEODE-7727 > Project: Geode > Issue Type: Bug >Reporter: Mario Ivanac >Assignee: Mario Ivanac >Priority: Major > Labels: needs-review, pull-request-available > Fix For: 1.13.0 > > Time Spent: 3.5h > Remaining Estimate: 0h > > {color:#172b4d}Geode P2P handshake reader stops listening to it's socket once > the handshake between 2 peers is established. This seems to be a design > choice. > {color} > {color:#172b4d}The problem is when the connection gets killed (TCP FIN). > Since nothing is listening on the socket, nothing will get that FIN package > and close the connection. The connection is left hanging (CLOSE-WAIT state). > The peers are then unable to establish proper P2P communication later.{color} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-7727) Geode P2P connection hanging
[ https://issues.apache.org/jira/browse/GEODE-7727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227050#comment-17227050 ] ASF subversion and git services commented on GEODE-7727: Commit 798a245147835c1e1b0026e863b9816a3ce2c551 in geode's branch refs/heads/support/1.12 from Mario Ivanac [ https://gitbox.apache.org/repos/asf?p=geode.git;h=798a245 ] GEODE-7727: modify sender thread to detect relese of connection (#4751) * GEODE-7727: modify sender thread to detect relese of connection * GEODE-7727: Update solution only for shared connections * GEODE-7727: added test * GEODE-7727: update ater comments * GEODE-7727: update test * GEODE-7727: fix for async write hanging * GEODE-7727: Test of region operations in the face of closed connections Adding a test for what happens to region operations when a connection is closed out from under the system. This test hangs without the changes to let the reader thread keep running. Fix to test * GEODE-7727: Preventing a double release of the input buffer The releaseInputBuffer method was not thread safe. If it is called concurrently, it will end up being released twice, which will add the buffer to to the buffer pool twice. Later, this could result in two threads using the same buffer, resulting in corruption of the buffer. With the changes for GEODE-7727, we made it likely that releaseInputBuffer would be called concurrently. If a member departs, one thread will call Connection.close. Connection.close will close the socket and call releaseInputBuffer. However, closing the socket will wake up the reader thread, which will also call releaseInputBuffer concurrently. Making releaseInputBuffer thread safe by introducing a lock. * GEODE-7727: update after merge * GEODE-7727: update test name Co-authored-by: Dan Smith (cherry picked from commit c8413592e5573f675c538c63ef9ee9f97a349e73) > Geode P2P connection hanging > > > Key: GEODE-7727 > URL: https://issues.apache.org/jira/browse/GEODE-7727 > Project: Geode > Issue Type: Bug >Reporter: Mario Ivanac >Assignee: Mario Ivanac >Priority: Major > Labels: needs-review, pull-request-available > Fix For: 1.13.0 > > Time Spent: 3.5h > Remaining Estimate: 0h > > {color:#172b4d}Geode P2P handshake reader stops listening to it's socket once > the handshake between 2 peers is established. This seems to be a design > choice. > {color} > {color:#172b4d}The problem is when the connection gets killed (TCP FIN). > Since nothing is listening on the socket, nothing will get that FIN package > and close the connection. The connection is left hanging (CLOSE-WAIT state). > The peers are then unable to establish proper P2P communication later.{color} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-7727) Geode P2P connection hanging
[ https://issues.apache.org/jira/browse/GEODE-7727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227048#comment-17227048 ] ASF subversion and git services commented on GEODE-7727: Commit 798a245147835c1e1b0026e863b9816a3ce2c551 in geode's branch refs/heads/support/1.12 from Mario Ivanac [ https://gitbox.apache.org/repos/asf?p=geode.git;h=798a245 ] GEODE-7727: modify sender thread to detect relese of connection (#4751) * GEODE-7727: modify sender thread to detect relese of connection * GEODE-7727: Update solution only for shared connections * GEODE-7727: added test * GEODE-7727: update ater comments * GEODE-7727: update test * GEODE-7727: fix for async write hanging * GEODE-7727: Test of region operations in the face of closed connections Adding a test for what happens to region operations when a connection is closed out from under the system. This test hangs without the changes to let the reader thread keep running. Fix to test * GEODE-7727: Preventing a double release of the input buffer The releaseInputBuffer method was not thread safe. If it is called concurrently, it will end up being released twice, which will add the buffer to to the buffer pool twice. Later, this could result in two threads using the same buffer, resulting in corruption of the buffer. With the changes for GEODE-7727, we made it likely that releaseInputBuffer would be called concurrently. If a member departs, one thread will call Connection.close. Connection.close will close the socket and call releaseInputBuffer. However, closing the socket will wake up the reader thread, which will also call releaseInputBuffer concurrently. Making releaseInputBuffer thread safe by introducing a lock. * GEODE-7727: update after merge * GEODE-7727: update test name Co-authored-by: Dan Smith (cherry picked from commit c8413592e5573f675c538c63ef9ee9f97a349e73) > Geode P2P connection hanging > > > Key: GEODE-7727 > URL: https://issues.apache.org/jira/browse/GEODE-7727 > Project: Geode > Issue Type: Bug >Reporter: Mario Ivanac >Assignee: Mario Ivanac >Priority: Major > Labels: needs-review, pull-request-available > Fix For: 1.13.0 > > Time Spent: 3.5h > Remaining Estimate: 0h > > {color:#172b4d}Geode P2P handshake reader stops listening to it's socket once > the handshake between 2 peers is established. This seems to be a design > choice. > {color} > {color:#172b4d}The problem is when the connection gets killed (TCP FIN). > Since nothing is listening on the socket, nothing will get that FIN package > and close the connection. The connection is left hanging (CLOSE-WAIT state). > The peers are then unable to establish proper P2P communication later.{color} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-7727) Geode P2P connection hanging
[ https://issues.apache.org/jira/browse/GEODE-7727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227056#comment-17227056 ] ASF subversion and git services commented on GEODE-7727: Commit 798a245147835c1e1b0026e863b9816a3ce2c551 in geode's branch refs/heads/support/1.12 from Mario Ivanac [ https://gitbox.apache.org/repos/asf?p=geode.git;h=798a245 ] GEODE-7727: modify sender thread to detect relese of connection (#4751) * GEODE-7727: modify sender thread to detect relese of connection * GEODE-7727: Update solution only for shared connections * GEODE-7727: added test * GEODE-7727: update ater comments * GEODE-7727: update test * GEODE-7727: fix for async write hanging * GEODE-7727: Test of region operations in the face of closed connections Adding a test for what happens to region operations when a connection is closed out from under the system. This test hangs without the changes to let the reader thread keep running. Fix to test * GEODE-7727: Preventing a double release of the input buffer The releaseInputBuffer method was not thread safe. If it is called concurrently, it will end up being released twice, which will add the buffer to to the buffer pool twice. Later, this could result in two threads using the same buffer, resulting in corruption of the buffer. With the changes for GEODE-7727, we made it likely that releaseInputBuffer would be called concurrently. If a member departs, one thread will call Connection.close. Connection.close will close the socket and call releaseInputBuffer. However, closing the socket will wake up the reader thread, which will also call releaseInputBuffer concurrently. Making releaseInputBuffer thread safe by introducing a lock. * GEODE-7727: update after merge * GEODE-7727: update test name Co-authored-by: Dan Smith (cherry picked from commit c8413592e5573f675c538c63ef9ee9f97a349e73) > Geode P2P connection hanging > > > Key: GEODE-7727 > URL: https://issues.apache.org/jira/browse/GEODE-7727 > Project: Geode > Issue Type: Bug >Reporter: Mario Ivanac >Assignee: Mario Ivanac >Priority: Major > Labels: needs-review, pull-request-available > Fix For: 1.13.0 > > Time Spent: 3.5h > Remaining Estimate: 0h > > {color:#172b4d}Geode P2P handshake reader stops listening to it's socket once > the handshake between 2 peers is established. This seems to be a design > choice. > {color} > {color:#172b4d}The problem is when the connection gets killed (TCP FIN). > Since nothing is listening on the socket, nothing will get that FIN package > and close the connection. The connection is left hanging (CLOSE-WAIT state). > The peers are then unable to establish proper P2P communication later.{color} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-7727) Geode P2P connection hanging
[ https://issues.apache.org/jira/browse/GEODE-7727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227057#comment-17227057 ] ASF subversion and git services commented on GEODE-7727: Commit 798a245147835c1e1b0026e863b9816a3ce2c551 in geode's branch refs/heads/support/1.12 from Mario Ivanac [ https://gitbox.apache.org/repos/asf?p=geode.git;h=798a245 ] GEODE-7727: modify sender thread to detect relese of connection (#4751) * GEODE-7727: modify sender thread to detect relese of connection * GEODE-7727: Update solution only for shared connections * GEODE-7727: added test * GEODE-7727: update ater comments * GEODE-7727: update test * GEODE-7727: fix for async write hanging * GEODE-7727: Test of region operations in the face of closed connections Adding a test for what happens to region operations when a connection is closed out from under the system. This test hangs without the changes to let the reader thread keep running. Fix to test * GEODE-7727: Preventing a double release of the input buffer The releaseInputBuffer method was not thread safe. If it is called concurrently, it will end up being released twice, which will add the buffer to to the buffer pool twice. Later, this could result in two threads using the same buffer, resulting in corruption of the buffer. With the changes for GEODE-7727, we made it likely that releaseInputBuffer would be called concurrently. If a member departs, one thread will call Connection.close. Connection.close will close the socket and call releaseInputBuffer. However, closing the socket will wake up the reader thread, which will also call releaseInputBuffer concurrently. Making releaseInputBuffer thread safe by introducing a lock. * GEODE-7727: update after merge * GEODE-7727: update test name Co-authored-by: Dan Smith (cherry picked from commit c8413592e5573f675c538c63ef9ee9f97a349e73) > Geode P2P connection hanging > > > Key: GEODE-7727 > URL: https://issues.apache.org/jira/browse/GEODE-7727 > Project: Geode > Issue Type: Bug >Reporter: Mario Ivanac >Assignee: Mario Ivanac >Priority: Major > Labels: needs-review, pull-request-available > Fix For: 1.13.0 > > Time Spent: 3.5h > Remaining Estimate: 0h > > {color:#172b4d}Geode P2P handshake reader stops listening to it's socket once > the handshake between 2 peers is established. This seems to be a design > choice. > {color} > {color:#172b4d}The problem is when the connection gets killed (TCP FIN). > Since nothing is listening on the socket, nothing will get that FIN package > and close the connection. The connection is left hanging (CLOSE-WAIT state). > The peers are then unable to establish proper P2P communication later.{color} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-7727) Geode P2P connection hanging
[ https://issues.apache.org/jira/browse/GEODE-7727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227060#comment-17227060 ] ASF subversion and git services commented on GEODE-7727: Commit 798a245147835c1e1b0026e863b9816a3ce2c551 in geode's branch refs/heads/support/1.12 from Mario Ivanac [ https://gitbox.apache.org/repos/asf?p=geode.git;h=798a245 ] GEODE-7727: modify sender thread to detect relese of connection (#4751) * GEODE-7727: modify sender thread to detect relese of connection * GEODE-7727: Update solution only for shared connections * GEODE-7727: added test * GEODE-7727: update ater comments * GEODE-7727: update test * GEODE-7727: fix for async write hanging * GEODE-7727: Test of region operations in the face of closed connections Adding a test for what happens to region operations when a connection is closed out from under the system. This test hangs without the changes to let the reader thread keep running. Fix to test * GEODE-7727: Preventing a double release of the input buffer The releaseInputBuffer method was not thread safe. If it is called concurrently, it will end up being released twice, which will add the buffer to to the buffer pool twice. Later, this could result in two threads using the same buffer, resulting in corruption of the buffer. With the changes for GEODE-7727, we made it likely that releaseInputBuffer would be called concurrently. If a member departs, one thread will call Connection.close. Connection.close will close the socket and call releaseInputBuffer. However, closing the socket will wake up the reader thread, which will also call releaseInputBuffer concurrently. Making releaseInputBuffer thread safe by introducing a lock. * GEODE-7727: update after merge * GEODE-7727: update test name Co-authored-by: Dan Smith (cherry picked from commit c8413592e5573f675c538c63ef9ee9f97a349e73) > Geode P2P connection hanging > > > Key: GEODE-7727 > URL: https://issues.apache.org/jira/browse/GEODE-7727 > Project: Geode > Issue Type: Bug >Reporter: Mario Ivanac >Assignee: Mario Ivanac >Priority: Major > Labels: needs-review, pull-request-available > Fix For: 1.13.0 > > Time Spent: 3.5h > Remaining Estimate: 0h > > {color:#172b4d}Geode P2P handshake reader stops listening to it's socket once > the handshake between 2 peers is established. This seems to be a design > choice. > {color} > {color:#172b4d}The problem is when the connection gets killed (TCP FIN). > Since nothing is listening on the socket, nothing will get that FIN package > and close the connection. The connection is left hanging (CLOSE-WAIT state). > The peers are then unable to establish proper P2P communication later.{color} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-7727) Geode P2P connection hanging
[ https://issues.apache.org/jira/browse/GEODE-7727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227043#comment-17227043 ] ASF subversion and git services commented on GEODE-7727: Commit 798a245147835c1e1b0026e863b9816a3ce2c551 in geode's branch refs/heads/support/1.12 from Mario Ivanac [ https://gitbox.apache.org/repos/asf?p=geode.git;h=798a245 ] GEODE-7727: modify sender thread to detect relese of connection (#4751) * GEODE-7727: modify sender thread to detect relese of connection * GEODE-7727: Update solution only for shared connections * GEODE-7727: added test * GEODE-7727: update ater comments * GEODE-7727: update test * GEODE-7727: fix for async write hanging * GEODE-7727: Test of region operations in the face of closed connections Adding a test for what happens to region operations when a connection is closed out from under the system. This test hangs without the changes to let the reader thread keep running. Fix to test * GEODE-7727: Preventing a double release of the input buffer The releaseInputBuffer method was not thread safe. If it is called concurrently, it will end up being released twice, which will add the buffer to to the buffer pool twice. Later, this could result in two threads using the same buffer, resulting in corruption of the buffer. With the changes for GEODE-7727, we made it likely that releaseInputBuffer would be called concurrently. If a member departs, one thread will call Connection.close. Connection.close will close the socket and call releaseInputBuffer. However, closing the socket will wake up the reader thread, which will also call releaseInputBuffer concurrently. Making releaseInputBuffer thread safe by introducing a lock. * GEODE-7727: update after merge * GEODE-7727: update test name Co-authored-by: Dan Smith (cherry picked from commit c8413592e5573f675c538c63ef9ee9f97a349e73) > Geode P2P connection hanging > > > Key: GEODE-7727 > URL: https://issues.apache.org/jira/browse/GEODE-7727 > Project: Geode > Issue Type: Bug >Reporter: Mario Ivanac >Assignee: Mario Ivanac >Priority: Major > Labels: needs-review, pull-request-available > Fix For: 1.13.0 > > Time Spent: 3.5h > Remaining Estimate: 0h > > {color:#172b4d}Geode P2P handshake reader stops listening to it's socket once > the handshake between 2 peers is established. This seems to be a design > choice. > {color} > {color:#172b4d}The problem is when the connection gets killed (TCP FIN). > Since nothing is listening on the socket, nothing will get that FIN package > and close the connection. The connection is left hanging (CLOSE-WAIT state). > The peers are then unable to establish proper P2P communication later.{color} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-7727) Geode P2P connection hanging
[ https://issues.apache.org/jira/browse/GEODE-7727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227058#comment-17227058 ] ASF subversion and git services commented on GEODE-7727: Commit 798a245147835c1e1b0026e863b9816a3ce2c551 in geode's branch refs/heads/support/1.12 from Mario Ivanac [ https://gitbox.apache.org/repos/asf?p=geode.git;h=798a245 ] GEODE-7727: modify sender thread to detect relese of connection (#4751) * GEODE-7727: modify sender thread to detect relese of connection * GEODE-7727: Update solution only for shared connections * GEODE-7727: added test * GEODE-7727: update ater comments * GEODE-7727: update test * GEODE-7727: fix for async write hanging * GEODE-7727: Test of region operations in the face of closed connections Adding a test for what happens to region operations when a connection is closed out from under the system. This test hangs without the changes to let the reader thread keep running. Fix to test * GEODE-7727: Preventing a double release of the input buffer The releaseInputBuffer method was not thread safe. If it is called concurrently, it will end up being released twice, which will add the buffer to to the buffer pool twice. Later, this could result in two threads using the same buffer, resulting in corruption of the buffer. With the changes for GEODE-7727, we made it likely that releaseInputBuffer would be called concurrently. If a member departs, one thread will call Connection.close. Connection.close will close the socket and call releaseInputBuffer. However, closing the socket will wake up the reader thread, which will also call releaseInputBuffer concurrently. Making releaseInputBuffer thread safe by introducing a lock. * GEODE-7727: update after merge * GEODE-7727: update test name Co-authored-by: Dan Smith (cherry picked from commit c8413592e5573f675c538c63ef9ee9f97a349e73) > Geode P2P connection hanging > > > Key: GEODE-7727 > URL: https://issues.apache.org/jira/browse/GEODE-7727 > Project: Geode > Issue Type: Bug >Reporter: Mario Ivanac >Assignee: Mario Ivanac >Priority: Major > Labels: needs-review, pull-request-available > Fix For: 1.13.0 > > Time Spent: 3.5h > Remaining Estimate: 0h > > {color:#172b4d}Geode P2P handshake reader stops listening to it's socket once > the handshake between 2 peers is established. This seems to be a design > choice. > {color} > {color:#172b4d}The problem is when the connection gets killed (TCP FIN). > Since nothing is listening on the socket, nothing will get that FIN package > and close the connection. The connection is left hanging (CLOSE-WAIT state). > The peers are then unable to establish proper P2P communication later.{color} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-7727) Geode P2P connection hanging
[ https://issues.apache.org/jira/browse/GEODE-7727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227055#comment-17227055 ] ASF subversion and git services commented on GEODE-7727: Commit 798a245147835c1e1b0026e863b9816a3ce2c551 in geode's branch refs/heads/support/1.12 from Mario Ivanac [ https://gitbox.apache.org/repos/asf?p=geode.git;h=798a245 ] GEODE-7727: modify sender thread to detect relese of connection (#4751) * GEODE-7727: modify sender thread to detect relese of connection * GEODE-7727: Update solution only for shared connections * GEODE-7727: added test * GEODE-7727: update ater comments * GEODE-7727: update test * GEODE-7727: fix for async write hanging * GEODE-7727: Test of region operations in the face of closed connections Adding a test for what happens to region operations when a connection is closed out from under the system. This test hangs without the changes to let the reader thread keep running. Fix to test * GEODE-7727: Preventing a double release of the input buffer The releaseInputBuffer method was not thread safe. If it is called concurrently, it will end up being released twice, which will add the buffer to to the buffer pool twice. Later, this could result in two threads using the same buffer, resulting in corruption of the buffer. With the changes for GEODE-7727, we made it likely that releaseInputBuffer would be called concurrently. If a member departs, one thread will call Connection.close. Connection.close will close the socket and call releaseInputBuffer. However, closing the socket will wake up the reader thread, which will also call releaseInputBuffer concurrently. Making releaseInputBuffer thread safe by introducing a lock. * GEODE-7727: update after merge * GEODE-7727: update test name Co-authored-by: Dan Smith (cherry picked from commit c8413592e5573f675c538c63ef9ee9f97a349e73) > Geode P2P connection hanging > > > Key: GEODE-7727 > URL: https://issues.apache.org/jira/browse/GEODE-7727 > Project: Geode > Issue Type: Bug >Reporter: Mario Ivanac >Assignee: Mario Ivanac >Priority: Major > Labels: needs-review, pull-request-available > Fix For: 1.13.0 > > Time Spent: 3.5h > Remaining Estimate: 0h > > {color:#172b4d}Geode P2P handshake reader stops listening to it's socket once > the handshake between 2 peers is established. This seems to be a design > choice. > {color} > {color:#172b4d}The problem is when the connection gets killed (TCP FIN). > Since nothing is listening on the socket, nothing will get that FIN package > and close the connection. The connection is left hanging (CLOSE-WAIT state). > The peers are then unable to establish proper P2P communication later.{color} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-7727) Geode P2P connection hanging
[ https://issues.apache.org/jira/browse/GEODE-7727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227059#comment-17227059 ] ASF subversion and git services commented on GEODE-7727: Commit 798a245147835c1e1b0026e863b9816a3ce2c551 in geode's branch refs/heads/support/1.12 from Mario Ivanac [ https://gitbox.apache.org/repos/asf?p=geode.git;h=798a245 ] GEODE-7727: modify sender thread to detect relese of connection (#4751) * GEODE-7727: modify sender thread to detect relese of connection * GEODE-7727: Update solution only for shared connections * GEODE-7727: added test * GEODE-7727: update ater comments * GEODE-7727: update test * GEODE-7727: fix for async write hanging * GEODE-7727: Test of region operations in the face of closed connections Adding a test for what happens to region operations when a connection is closed out from under the system. This test hangs without the changes to let the reader thread keep running. Fix to test * GEODE-7727: Preventing a double release of the input buffer The releaseInputBuffer method was not thread safe. If it is called concurrently, it will end up being released twice, which will add the buffer to to the buffer pool twice. Later, this could result in two threads using the same buffer, resulting in corruption of the buffer. With the changes for GEODE-7727, we made it likely that releaseInputBuffer would be called concurrently. If a member departs, one thread will call Connection.close. Connection.close will close the socket and call releaseInputBuffer. However, closing the socket will wake up the reader thread, which will also call releaseInputBuffer concurrently. Making releaseInputBuffer thread safe by introducing a lock. * GEODE-7727: update after merge * GEODE-7727: update test name Co-authored-by: Dan Smith (cherry picked from commit c8413592e5573f675c538c63ef9ee9f97a349e73) > Geode P2P connection hanging > > > Key: GEODE-7727 > URL: https://issues.apache.org/jira/browse/GEODE-7727 > Project: Geode > Issue Type: Bug >Reporter: Mario Ivanac >Assignee: Mario Ivanac >Priority: Major > Labels: needs-review, pull-request-available > Fix For: 1.13.0 > > Time Spent: 3.5h > Remaining Estimate: 0h > > {color:#172b4d}Geode P2P handshake reader stops listening to it's socket once > the handshake between 2 peers is established. This seems to be a design > choice. > {color} > {color:#172b4d}The problem is when the connection gets killed (TCP FIN). > Since nothing is listening on the socket, nothing will get that FIN package > and close the connection. The connection is left hanging (CLOSE-WAIT state). > The peers are then unable to establish proper P2P communication later.{color} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (GEODE-8689) CI Failure: DistributedAckPersistentRegionCCEDUnitTest > testConcurrentEventsOnEmptyRegion FAILED
[ https://issues.apache.org/jira/browse/GEODE-8689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sarah Abbey resolved GEODE-8689. Resolution: Duplicate > CI Failure: DistributedAckPersistentRegionCCEDUnitTest > > testConcurrentEventsOnEmptyRegion FAILED > - > > Key: GEODE-8689 > URL: https://issues.apache.org/jira/browse/GEODE-8689 > Project: Geode > Issue Type: Bug >Reporter: Sarah Abbey >Priority: Major > > https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-develop-main/jobs/DistributedTestOpenJDK11/builds/574 > {code:java} > org.awaitility.core.ConditionTimeoutException: Condition with alias 'Wait for > the members to eventually be consistent' didn't complete within 5 minutes > because assertion condition defined as a lambda expression in > org.apache.geode.cache30.MultiVMRegionTestCase [region contents are not > consistent for cckey7] expected:<"ccvalue513398912"> but was:. > at org.awaitility.core.ConditionAwaiter.await(ConditionAwaiter.java:165) > at > org.awaitility.core.AssertionCondition.await(AssertionCondition.java:119) > at > org.awaitility.core.AssertionCondition.await(AssertionCondition.java:31) > at org.awaitility.core.ConditionFactory.until(ConditionFactory.java:895) > at > org.awaitility.core.ConditionFactory.untilAsserted(ConditionFactory.java:679) > ... > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-7472) DistributedAckOverflowRegionCCEOffHeapDUnitTest.testConcurrentEventsOnEmptyRegion failed in DistributedTestOpenJDK8
[ https://issues.apache.org/jira/browse/GEODE-7472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227088#comment-17227088 ] Sarah Abbey commented on GEODE-7472: Re-opening issue due to CI failure: https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-develop-main/jobs/DistributedTestOpenJDK11/builds/574 > DistributedAckOverflowRegionCCEOffHeapDUnitTest.testConcurrentEventsOnEmptyRegion > failed in DistributedTestOpenJDK8 > --- > > Key: GEODE-7472 > URL: https://issues.apache.org/jira/browse/GEODE-7472 > Project: Geode > Issue Type: Bug > Components: tests >Affects Versions: 1.12.0 >Reporter: Mark Hanson >Assignee: Ernest Burghardt >Priority: Major > Labels: flaky > Fix For: 1.12.0 > > > testConcurrentEvents is failing in testing. > > [https://concourse.apachegeode-ci.info/teams/main/pipelines/mhansonp-mhanson-mass-test-ru-main/jobs/DistributedTestOpenJDK8/builds/1680] > {noformat} > org.apache.geode.cache30.DistributedAckOverflowRegionCCEOffHeapDUnitTest > > testConcurrentEventsOnEmptyRegion FAILED > org.awaitility.core.ConditionTimeoutException: Condition with alias 'Wait > for the members to eventually be consistent' didn't complete within 300 > seconds because assertion condition defined as a lambda expression in > org.apache.geode.cache30.MultiVMRegionTestCase that uses > org.apache.geode.test.dunit.VM, > org.apache.geode.test.dunit.VMorg.apache.geode.test.dunit.VM, > org.apache.geode.test.dunit.VMorg.apache.geode.test.dunit.VM [r2 contents are > not consistent with r1 for cckey7] expected:<"ccvalue1212233561"> but > was:. > Caused by: > org.junit.ComparisonFailure: [r2 contents are not consistent with r1 > for cckey7] expected:<"ccvalue1212233561"> but was: > {noformat} > > {noformat} > =-=-=-=-=-=-=-=-=-=-=-=-=-=-= Test Results URI > =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= > http://files.apachegeode-ci.info/builds/mhansonp-mhanson-mass-test-ru-main/1.10.0-SNAPSHOT.0007/test-results/distributedTest/1574073134/ > =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= > Test report artifacts from this job are available at: > http://files.apachegeode-ci.info/builds/mhansonp-mhanson-mass-test-ru-main/1.10.0-SNAPSHOT.0007/test-artifacts/1574073134/distributedtestfiles-OpenJDK8-1.10.0-SNAPSHOT.0007.tgz{noformat} > [https://concourse.apachegeode-ci.info/teams/main/pipelines/mhansonp-mhanson-mass-test-ru-main/jobs/DistributedTestOpenJDK8/builds/1613] > {noformat} > org.apache.geode.cache30.DistributedAckOverflowRegionCCEOffHeapDUnitTest > > testConcurrentEventsOnEmptyRegion FAILED > 22:20:16org.awaitility.core.ConditionTimeoutException: Condition with > alias 'Wait for the members to eventually be consistent' didn't complete > within 300 seconds because assertion condition defined as a lambda expression > in org.apache.geode.cache30.MultiVMRegionTestCase that uses > org.apache.geode.test.dunit.VM, > org.apache.geode.test.dunit.VMorg.apache.geode.test.dunit.VM, > org.apache.geode.test.dunit.VMorg.apache.geode.test.dunit.VM [r2 contents are > not consistent with r1 for subkey cckey3-1] expected:<"ccvalue[-1702102599]"> > but was:<"ccvalue[2145556138]">. > 22:20:16 > 22:20:16Caused by: > 22:20:16org.junit.ComparisonFailure: [r2 contents are not consistent > with r1 for subkey cckey3-1] expected:<"ccvalue[-1702102599]"> but > was:<"ccvalue[2145556138]"> > 23:12:55 {noformat} > > {noformat} > =-=-=-=-=-=-=-=-=-=-=-=-=-=-= Test Results URI > =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= > http://files.apachegeode-ci.info/builds/mhansonp-mhanson-mass-test-ru-main/1.10.0-SNAPSHOT.0007/test-results/distributedTest/1573975176/ > =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= > Test report artifacts from this job are available at: > http://files.apachegeode-ci.info/builds/mhansonp-mhanson-mass-test-ru-main/1.10.0-SNAPSHOT.0007/test-artifacts/1573975176/distributedtestfiles-OpenJDK8-1.10.0-SNAPSHOT.0007.tgz{noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (GEODE-7472) DistributedAckOverflowRegionCCEOffHeapDUnitTest.testConcurrentEventsOnEmptyRegion failed in DistributedTestOpenJDK8
[ https://issues.apache.org/jira/browse/GEODE-7472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sarah Abbey reassigned GEODE-7472: -- Assignee: (was: Ernest Burghardt) > DistributedAckOverflowRegionCCEOffHeapDUnitTest.testConcurrentEventsOnEmptyRegion > failed in DistributedTestOpenJDK8 > --- > > Key: GEODE-7472 > URL: https://issues.apache.org/jira/browse/GEODE-7472 > Project: Geode > Issue Type: Bug > Components: tests >Affects Versions: 1.12.0 >Reporter: Mark Hanson >Priority: Major > Labels: flaky > Fix For: 1.12.0 > > > testConcurrentEvents is failing in testing. > > [https://concourse.apachegeode-ci.info/teams/main/pipelines/mhansonp-mhanson-mass-test-ru-main/jobs/DistributedTestOpenJDK8/builds/1680] > {noformat} > org.apache.geode.cache30.DistributedAckOverflowRegionCCEOffHeapDUnitTest > > testConcurrentEventsOnEmptyRegion FAILED > org.awaitility.core.ConditionTimeoutException: Condition with alias 'Wait > for the members to eventually be consistent' didn't complete within 300 > seconds because assertion condition defined as a lambda expression in > org.apache.geode.cache30.MultiVMRegionTestCase that uses > org.apache.geode.test.dunit.VM, > org.apache.geode.test.dunit.VMorg.apache.geode.test.dunit.VM, > org.apache.geode.test.dunit.VMorg.apache.geode.test.dunit.VM [r2 contents are > not consistent with r1 for cckey7] expected:<"ccvalue1212233561"> but > was:. > Caused by: > org.junit.ComparisonFailure: [r2 contents are not consistent with r1 > for cckey7] expected:<"ccvalue1212233561"> but was: > {noformat} > > {noformat} > =-=-=-=-=-=-=-=-=-=-=-=-=-=-= Test Results URI > =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= > http://files.apachegeode-ci.info/builds/mhansonp-mhanson-mass-test-ru-main/1.10.0-SNAPSHOT.0007/test-results/distributedTest/1574073134/ > =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= > Test report artifacts from this job are available at: > http://files.apachegeode-ci.info/builds/mhansonp-mhanson-mass-test-ru-main/1.10.0-SNAPSHOT.0007/test-artifacts/1574073134/distributedtestfiles-OpenJDK8-1.10.0-SNAPSHOT.0007.tgz{noformat} > [https://concourse.apachegeode-ci.info/teams/main/pipelines/mhansonp-mhanson-mass-test-ru-main/jobs/DistributedTestOpenJDK8/builds/1613] > {noformat} > org.apache.geode.cache30.DistributedAckOverflowRegionCCEOffHeapDUnitTest > > testConcurrentEventsOnEmptyRegion FAILED > 22:20:16org.awaitility.core.ConditionTimeoutException: Condition with > alias 'Wait for the members to eventually be consistent' didn't complete > within 300 seconds because assertion condition defined as a lambda expression > in org.apache.geode.cache30.MultiVMRegionTestCase that uses > org.apache.geode.test.dunit.VM, > org.apache.geode.test.dunit.VMorg.apache.geode.test.dunit.VM, > org.apache.geode.test.dunit.VMorg.apache.geode.test.dunit.VM [r2 contents are > not consistent with r1 for subkey cckey3-1] expected:<"ccvalue[-1702102599]"> > but was:<"ccvalue[2145556138]">. > 22:20:16 > 22:20:16Caused by: > 22:20:16org.junit.ComparisonFailure: [r2 contents are not consistent > with r1 for subkey cckey3-1] expected:<"ccvalue[-1702102599]"> but > was:<"ccvalue[2145556138]"> > 23:12:55 {noformat} > > {noformat} > =-=-=-=-=-=-=-=-=-=-=-=-=-=-= Test Results URI > =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= > http://files.apachegeode-ci.info/builds/mhansonp-mhanson-mass-test-ru-main/1.10.0-SNAPSHOT.0007/test-results/distributedTest/1573975176/ > =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= > Test report artifacts from this job are available at: > http://files.apachegeode-ci.info/builds/mhansonp-mhanson-mass-test-ru-main/1.10.0-SNAPSHOT.0007/test-artifacts/1573975176/distributedtestfiles-OpenJDK8-1.10.0-SNAPSHOT.0007.tgz{noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-7472) DistributedAckOverflowRegionCCEOffHeapDUnitTest.testConcurrentEventsOnEmptyRegion failed in DistributedTestOpenJDK8
[ https://issues.apache.org/jira/browse/GEODE-7472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227089#comment-17227089 ] Geode Integration commented on GEODE-7472: -- Seen in [DistributedTestOpenJDK11 #574|https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-develop-main/jobs/DistributedTestOpenJDK11/builds/574] ... see [test results|http://files.apachegeode-ci.info/builds/apache-develop-main/1.14.0-build.0467/test-results/distributedTest/1604579602/] or download [artifacts|http://files.apachegeode-ci.info/builds/apache-develop-main/1.14.0-build.0467/test-artifacts/1604579602/distributedtestfiles-OpenJDK11-1.14.0-build.0467.tgz]. > DistributedAckOverflowRegionCCEOffHeapDUnitTest.testConcurrentEventsOnEmptyRegion > failed in DistributedTestOpenJDK8 > --- > > Key: GEODE-7472 > URL: https://issues.apache.org/jira/browse/GEODE-7472 > Project: Geode > Issue Type: Bug > Components: tests >Affects Versions: 1.12.0 >Reporter: Mark Hanson >Priority: Major > Labels: flaky > Fix For: 1.12.0 > > > testConcurrentEvents is failing in testing. > > [https://concourse.apachegeode-ci.info/teams/main/pipelines/mhansonp-mhanson-mass-test-ru-main/jobs/DistributedTestOpenJDK8/builds/1680] > {noformat} > org.apache.geode.cache30.DistributedAckOverflowRegionCCEOffHeapDUnitTest > > testConcurrentEventsOnEmptyRegion FAILED > org.awaitility.core.ConditionTimeoutException: Condition with alias 'Wait > for the members to eventually be consistent' didn't complete within 300 > seconds because assertion condition defined as a lambda expression in > org.apache.geode.cache30.MultiVMRegionTestCase that uses > org.apache.geode.test.dunit.VM, > org.apache.geode.test.dunit.VMorg.apache.geode.test.dunit.VM, > org.apache.geode.test.dunit.VMorg.apache.geode.test.dunit.VM [r2 contents are > not consistent with r1 for cckey7] expected:<"ccvalue1212233561"> but > was:. > Caused by: > org.junit.ComparisonFailure: [r2 contents are not consistent with r1 > for cckey7] expected:<"ccvalue1212233561"> but was: > {noformat} > > {noformat} > =-=-=-=-=-=-=-=-=-=-=-=-=-=-= Test Results URI > =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= > http://files.apachegeode-ci.info/builds/mhansonp-mhanson-mass-test-ru-main/1.10.0-SNAPSHOT.0007/test-results/distributedTest/1574073134/ > =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= > Test report artifacts from this job are available at: > http://files.apachegeode-ci.info/builds/mhansonp-mhanson-mass-test-ru-main/1.10.0-SNAPSHOT.0007/test-artifacts/1574073134/distributedtestfiles-OpenJDK8-1.10.0-SNAPSHOT.0007.tgz{noformat} > [https://concourse.apachegeode-ci.info/teams/main/pipelines/mhansonp-mhanson-mass-test-ru-main/jobs/DistributedTestOpenJDK8/builds/1613] > {noformat} > org.apache.geode.cache30.DistributedAckOverflowRegionCCEOffHeapDUnitTest > > testConcurrentEventsOnEmptyRegion FAILED > 22:20:16org.awaitility.core.ConditionTimeoutException: Condition with > alias 'Wait for the members to eventually be consistent' didn't complete > within 300 seconds because assertion condition defined as a lambda expression > in org.apache.geode.cache30.MultiVMRegionTestCase that uses > org.apache.geode.test.dunit.VM, > org.apache.geode.test.dunit.VMorg.apache.geode.test.dunit.VM, > org.apache.geode.test.dunit.VMorg.apache.geode.test.dunit.VM [r2 contents are > not consistent with r1 for subkey cckey3-1] expected:<"ccvalue[-1702102599]"> > but was:<"ccvalue[2145556138]">. > 22:20:16 > 22:20:16Caused by: > 22:20:16org.junit.ComparisonFailure: [r2 contents are not consistent > with r1 for subkey cckey3-1] expected:<"ccvalue[-1702102599]"> but > was:<"ccvalue[2145556138]"> > 23:12:55 {noformat} > > {noformat} > =-=-=-=-=-=-=-=-=-=-=-=-=-=-= Test Results URI > =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= > http://files.apachegeode-ci.info/builds/mhansonp-mhanson-mass-test-ru-main/1.10.0-SNAPSHOT.0007/test-results/distributedTest/1573975176/ > =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= > Test report artifacts from this job are available at: > http://files.apachegeode-ci.info/builds/mhansonp-mhanson-mass-test-ru-main/1.10.0-SNAPSHOT.0007/test-artifacts/1573975176/distributedtestfiles-OpenJDK8-1.10.0-SNAPSHOT.0007.tgz{noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-8681) peer-to-peer message loss due to sending connection closing with TLS enabled
[ https://issues.apache.org/jira/browse/GEODE-8681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227092#comment-17227092 ] ASF GitHub Bot commented on GEODE-8681: --- echobravopapa closed pull request #5706: URL: https://github.com/apache/geode/pull/5706 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > peer-to-peer message loss due to sending connection closing with TLS enabled > > > Key: GEODE-8681 > URL: https://issues.apache.org/jira/browse/GEODE-8681 > Project: Geode > Issue Type: Bug > Components: membership, messaging >Affects Versions: 1.10.0, 1.11.0, 1.12.0, 1.13.0 >Reporter: Bruce J Schuchardt >Assignee: Bruce J Schuchardt >Priority: Major > Labels: pull-request-available, release-blocker > > We have observed message loss when TLS is enabled and a distributed lock is > released right after sending a message that doesn't require acknowledgement > if the sending socket is immediately closed. The closing of sockets > immediately after sending a message is frequently seen in function execution > threads or server-side application threads that use this pattern: > {code:java} > try { > DistributedSystem.setThreadsSocketPolicy(false); > acquireDistributedLock(lockName); > (perform one or more cache operations) > } finally { > distLockService.unlock(lockName); > DistributedSystem.releaseThreadsSockets(); // closes the socket > } > {code} > The fault seems to be in NioSSLEngine.unwrap(), which throws an > SSLException() if it finds the SSLEngine is closed even though there is valid > data in its decrypt buffer. It shouldn't throw an exception in that case. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-8688) Flaxy C++ Native client integration test cases: PartitionRegionOpsTest.[get|put]PartitionedRegionWithRedundancyServerGoesDownSingleHop
[ https://issues.apache.org/jira/browse/GEODE-8688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227093#comment-17227093 ] ASF GitHub Bot commented on GEODE-8688: --- pivotal-jbarrett commented on a change in pull request #686: URL: https://github.com/apache/geode-native/pull/686#discussion_r518456823 ## File path: cppcache/integration/test/PartitionRegionOpsTest.cpp ## @@ -144,6 +146,10 @@ void verifyMetadataWasRemovedAtFirstError() { } } } + std::cout << "timeoutErrors: " << timeoutErrors << ", ioErrors: " << ioErrors +<< ", metadataRemovedDueToTimeout: " << metadataRemovedDueToTimeout +<< ", metadataRemovedDueToIoErr: " << metadataRemovedDueToIoErr +<< std::endl; Review comment: Tests should never output anything other than assertion failures. If more information is useful to log then it is useful to assert. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Flaxy C++ Native client integration test cases: > PartitionRegionOpsTest.[get|put]PartitionedRegionWithRedundancyServerGoesDownSingleHop > -- > > Key: GEODE-8688 > URL: https://issues.apache.org/jira/browse/GEODE-8688 > Project: Geode > Issue Type: Bug > Components: native client >Affects Versions: 1.13.0 >Reporter: Alberto Gomez >Assignee: Alberto Gomez >Priority: Major > Labels: pull-request-available > > The following test cases for the C++ native client are flaky: > PartitionRegionOpsTest.getPartitionedRegionWithRedundancyServerGoesDownSingleHop > PartitionRegionOpsTest.putPartitionedRegionWithRedundancyServerGoesDownSingleHop > > They fail very often when run in CI although I have not seen them fail when > executed manually. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-8681) peer-to-peer message loss due to sending connection closing with TLS enabled
[ https://issues.apache.org/jira/browse/GEODE-8681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227094#comment-17227094 ] ASF GitHub Bot commented on GEODE-8681: --- echobravopapa opened a new pull request #5713: URL: https://github.com/apache/geode/pull/5713 …ng with TLS enabled (#5699) A socket-read could pick up more than one message and a single unwrap() could decrypt multiple messages. Normally the engine isn't closed and it reports normal status from an unwrap() operation, and Connection.processInputBuffer picks up each message, one by one, from the buffer and dispatches them. But if the SSLEngine is closed we were ignoring any already-decrypted data sitting in the unwrapped buffer and instead we were throwing an SSLException. (cherry picked from commit 7da8f9b516ac1e2525a1dfc922af7bfb8995f2c6) Thank you for submitting a contribution to Apache Geode. In order to streamline the review of the contribution we ask you to ensure the following steps have been taken: ### For all changes: - [ ] Is there a JIRA ticket associated with this PR? Is it referenced in the commit message? - [ ] Has your PR been rebased against the latest commit within the target branch (typically `develop`)? - [ ] Is your initial contribution a single, squashed commit? - [ ] Does `gradlew build` run cleanly? - [ ] Have you written or updated unit tests to verify your changes? - [ ] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)? ### Note: Please ensure that once the PR is submitted, check Concourse for build issues and submit an update to your PR as soon as possible. If you need help, please send an email to d...@geode.apache.org. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > peer-to-peer message loss due to sending connection closing with TLS enabled > > > Key: GEODE-8681 > URL: https://issues.apache.org/jira/browse/GEODE-8681 > Project: Geode > Issue Type: Bug > Components: membership, messaging >Affects Versions: 1.10.0, 1.11.0, 1.12.0, 1.13.0 >Reporter: Bruce J Schuchardt >Assignee: Bruce J Schuchardt >Priority: Major > Labels: pull-request-available, release-blocker > > We have observed message loss when TLS is enabled and a distributed lock is > released right after sending a message that doesn't require acknowledgement > if the sending socket is immediately closed. The closing of sockets > immediately after sending a message is frequently seen in function execution > threads or server-side application threads that use this pattern: > {code:java} > try { > DistributedSystem.setThreadsSocketPolicy(false); > acquireDistributedLock(lockName); > (perform one or more cache operations) > } finally { > distLockService.unlock(lockName); > DistributedSystem.releaseThreadsSockets(); // closes the socket > } > {code} > The fault seems to be in NioSSLEngine.unwrap(), which throws an > SSLException() if it finds the SSLEngine is closed even though there is valid > data in its decrypt buffer. It shouldn't throw an exception in that case. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-8667) Duplicate online Oplog compaction after offline Oplog compaction
[ https://issues.apache.org/jira/browse/GEODE-8667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227100#comment-17227100 ] Jianxia Chen commented on GEODE-8667: - When Oplog.totalCount == 0, no Oplog compaction is needed. > Duplicate online Oplog compaction after offline Oplog compaction > > > Key: GEODE-8667 > URL: https://issues.apache.org/jira/browse/GEODE-8667 > Project: Geode > Issue Type: Bug >Reporter: Jianxia Chen >Assignee: Jianxia Chen >Priority: Major > Labels: GeodeOperationAPI, pull-request-available > > Use `compact offline-disk-store` command to compact the Oplogs offline. > Then restart the servers. > The logs show OplogCompactor thread is compacting Oplogs again when > restarting the servers, even though the Oplogs were just compacted offline. > For example: > ``` > [info 2020/10/27 16:32:22.534 PDT tid=0x35] Recovered > values for disk store DEFAULT with unique id > 76393d3c-dd10-4b89-b655-821d37631774 > [info 2020/10/27 16:32:22.535 PDT > tid=0x35] OplogCompactor for DEFAULT compaction oplog id(s): oplog#2 > [info 2020/10/27 16:32:22.537 PDT > tid=0x35] compaction did 2 creates and updates in 2 ms > [info 2020/10/27 16:32:22.537 PDT tid=0x36] Deleted > oplog#2 crf for disk store DEFAULT. > [info 2020/10/27 16:32:22.538 PDT tid=0x36] Deleted > oplog#2 krf for disk store DEFAULT. > [info 2020/10/27 16:32:22.538 PDT tid=0x36] Deleted > oplog#2 drf for disk store DEFAULT. > ``` -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-8652) member hung in Connection.notifyHandshakeWaiter() during disconnect waiting for a lock held by another thread in Connection.readAck()
[ https://issues.apache.org/jira/browse/GEODE-8652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227102#comment-17227102 ] ASF GitHub Bot commented on GEODE-8652: --- Bill merged pull request #5712: URL: https://github.com/apache/geode/pull/5712 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > member hung in Connection.notifyHandshakeWaiter() during disconnect waiting > for a lock held by another thread in Connection.readAck() > -- > > Key: GEODE-8652 > URL: https://issues.apache.org/jira/browse/GEODE-8652 > Project: Geode > Issue Type: Bug > Components: membership, messaging >Affects Versions: 1.12.0, 1.13.0, 1.14.0 >Reporter: Bill Burcham >Assignee: Bill Burcham >Priority: Major > Labels: pull-request-available > Fix For: 1.12.1, 1.14.0, 1.13.1 > > > An application encountered the following hang in a TLS-enabled cluster. > Let's call the cluster members ds3 -> ds1. > ds3 sends a {{PutAllPRMessage}} to ds1 and is stuck in > {{SocketChannel.read()}} waiting for the acknowledgement. > {{ClusterDistributionManager.uncleanShutdown()}} is invoked on ds3 to shut > the member down. That thread blocks trying to acquire a lock on the > {{NioSslEngine}} held by the first thread (the one doing waiting for the ack > to the put-all.) > Somehow the shutdown thread must be allowed to proceed. > Here's the hung thread in ds3 (a.k.a. vm_3_thr_34_dataStore3_host1_8592) > trying to shut down the member but it's stuck waiting for the monitor on the > {{NioSslEngine}}: > {noformat} > "vm_3_thr_34_dataStore3_host1_8592" #795 daemon prio=5 os_prio=0 > tid=0x7fdb9c011000 nid=0x2fcc waiting for monitor entry > [0x7fdb6f4b7000] >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.geode.internal.tcp.Connection.notifyHandshakeWaiter(Connection.java:804) > - waiting to lock <0xf2635b28> (a > org.apache.geode.internal.net.NioSslEngine) > at org.apache.geode.internal.tcp.Connection.close(Connection.java:1350) > at > org.apache.geode.internal.tcp.Connection.closePartialConnect(Connection.java:1278) > at > org.apache.geode.internal.tcp.ConnectionTable.closeCon(ConnectionTable.java:612) > at > org.apache.geode.internal.tcp.ConnectionTable.closeCon(ConnectionTable.java:604) > at > org.apache.geode.internal.tcp.ConnectionTable.close(ConnectionTable.java:661) > - locked <0xf2678cf8> (a java.util.ArrayList) > - locked <0xf1187348> (a java.util.concurrent.ConcurrentHashMap) > at org.apache.geode.internal.tcp.TCPConduit.stop(TCPConduit.java:487) > at > org.apache.geode.distributed.internal.direct.DirectChannel.disconnect(DirectChannel.java:644) > - locked <0xf11867a8> (a > org.apache.geode.distributed.internal.direct.DirectChannel) > at > org.apache.geode.distributed.internal.DistributionImpl.disconnectDirectChannel(DistributionImpl.java:631) > at > org.apache.geode.distributed.internal.DistributionImpl.access$200(DistributionImpl.java:82) > at > org.apache.geode.distributed.internal.DistributionImpl$LifecycleListenerImpl.disconnect(DistributionImpl.java:904) > at > org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.stop(GMSMembership.java:1908) > at > org.apache.geode.distributed.internal.membership.gms.Services.stop(Services.java:302) > at > org.apache.geode.distributed.internal.membership.gms.GMSMembership.shutdown(GMSMembership.java:1262) > at > org.apache.geode.distributed.internal.membership.gms.GMSMembership.disconnect(GMSMembership.java:1847) > at > org.apache.geode.distributed.internal.DistributionImpl.disconnect(DistributionImpl.java:501) > at > org.apache.geode.distributed.internal.ClusterDistributionManager.uncleanShutdown(ClusterDistributionManager.java:1291) > {noformat} > That thread is waiting on a lock held by this thread (in ds3) which is > waiting on an acknowledgement to a PutAllPRMessage sent to ds1. > {noformat} > "vm_3_thr_37_dataStore3_host1_8592" #857 daemon prio=5 os_prio=0 > tid=0x7fdb9c030800 nid=0x30d1 runnable [0x7fdb732f] >java.lang.Thread.State: RUNNABLE > at sun.nio.ch.FileDispatcherImpl.read0(Native Method) > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) > at sun.nio.ch.IOUtil.readIntoNativeBuffer(
[jira] [Commented] (GEODE-8652) member hung in Connection.notifyHandshakeWaiter() during disconnect waiting for a lock held by another thread in Connection.readAck()
[ https://issues.apache.org/jira/browse/GEODE-8652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227103#comment-17227103 ] ASF subversion and git services commented on GEODE-8652: Commit af267c005a63317cbb8528cdb38eccf6a8747818 in geode's branch refs/heads/develop from Bill Burcham [ https://gitbox.apache.org/repos/asf?p=geode.git;h=af267c0 ] * GEODE-8652: NioSslEngine.close() Bypasses Locks (#5712) - NioSslEngine.close() proceeds even if readers (or writers) are operating on its ByteBuffers, allowing Connection.close() to close its socket and proceed. - NioSslEngine.close() needed a lock only on the output buffer, so we split what was a single lock into two. Also instead of using synchronized we use a ReentrantLock so we can call tryLock() and time out if needed in NioSslEngine.close(). - Since readers/writers may hold locks on these input/output buffers when NioSslEngine.close() is called a reference count is maintained and the buffers are returned to the pool only when the last user is done. - To manage the locking and reference counting a new AutoCloseable ByteBufferSharing interface is introduced with a trivial implementation: ByteBufferSharingNoOp and a real implementation: ByteBufferSharingImpl. - Added a new unit test, and a new concurrency test for ByteBufferSharingImpl: both ensure that ByteBuffers are returned to the pool exactly once. Added a new DUnit test for the interaction between ByteBufferSharingImpl and NioSslEngine and Connection. Co-authored-by: Bill Burcham Co-authored-by: Darrel Schneider Co-authored-by: Ernie Burghardt Co-authored-by: Dan Smith > member hung in Connection.notifyHandshakeWaiter() during disconnect waiting > for a lock held by another thread in Connection.readAck() > -- > > Key: GEODE-8652 > URL: https://issues.apache.org/jira/browse/GEODE-8652 > Project: Geode > Issue Type: Bug > Components: membership, messaging >Affects Versions: 1.12.0, 1.13.0, 1.14.0 >Reporter: Bill Burcham >Assignee: Bill Burcham >Priority: Major > Labels: pull-request-available > Fix For: 1.12.1, 1.14.0, 1.13.1 > > > An application encountered the following hang in a TLS-enabled cluster. > Let's call the cluster members ds3 -> ds1. > ds3 sends a {{PutAllPRMessage}} to ds1 and is stuck in > {{SocketChannel.read()}} waiting for the acknowledgement. > {{ClusterDistributionManager.uncleanShutdown()}} is invoked on ds3 to shut > the member down. That thread blocks trying to acquire a lock on the > {{NioSslEngine}} held by the first thread (the one doing waiting for the ack > to the put-all.) > Somehow the shutdown thread must be allowed to proceed. > Here's the hung thread in ds3 (a.k.a. vm_3_thr_34_dataStore3_host1_8592) > trying to shut down the member but it's stuck waiting for the monitor on the > {{NioSslEngine}}: > {noformat} > "vm_3_thr_34_dataStore3_host1_8592" #795 daemon prio=5 os_prio=0 > tid=0x7fdb9c011000 nid=0x2fcc waiting for monitor entry > [0x7fdb6f4b7000] >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.geode.internal.tcp.Connection.notifyHandshakeWaiter(Connection.java:804) > - waiting to lock <0xf2635b28> (a > org.apache.geode.internal.net.NioSslEngine) > at org.apache.geode.internal.tcp.Connection.close(Connection.java:1350) > at > org.apache.geode.internal.tcp.Connection.closePartialConnect(Connection.java:1278) > at > org.apache.geode.internal.tcp.ConnectionTable.closeCon(ConnectionTable.java:612) > at > org.apache.geode.internal.tcp.ConnectionTable.closeCon(ConnectionTable.java:604) > at > org.apache.geode.internal.tcp.ConnectionTable.close(ConnectionTable.java:661) > - locked <0xf2678cf8> (a java.util.ArrayList) > - locked <0xf1187348> (a java.util.concurrent.ConcurrentHashMap) > at org.apache.geode.internal.tcp.TCPConduit.stop(TCPConduit.java:487) > at > org.apache.geode.distributed.internal.direct.DirectChannel.disconnect(DirectChannel.java:644) > - locked <0xf11867a8> (a > org.apache.geode.distributed.internal.direct.DirectChannel) > at > org.apache.geode.distributed.internal.DistributionImpl.disconnectDirectChannel(DistributionImpl.java:631) > at > org.apache.geode.distributed.internal.DistributionImpl.access$200(DistributionImpl.java:82) > at > org.apache.geode.distributed.internal.DistributionImpl$LifecycleListenerImpl.disconnect(DistributionImpl.java:904) > at > org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.stop(GMSMembership.java:1908) > at >
[jira] [Commented] (GEODE-8652) member hung in Connection.notifyHandshakeWaiter() during disconnect waiting for a lock held by another thread in Connection.readAck()
[ https://issues.apache.org/jira/browse/GEODE-8652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227104#comment-17227104 ] ASF subversion and git services commented on GEODE-8652: Commit af267c005a63317cbb8528cdb38eccf6a8747818 in geode's branch refs/heads/develop from Bill Burcham [ https://gitbox.apache.org/repos/asf?p=geode.git;h=af267c0 ] * GEODE-8652: NioSslEngine.close() Bypasses Locks (#5712) - NioSslEngine.close() proceeds even if readers (or writers) are operating on its ByteBuffers, allowing Connection.close() to close its socket and proceed. - NioSslEngine.close() needed a lock only on the output buffer, so we split what was a single lock into two. Also instead of using synchronized we use a ReentrantLock so we can call tryLock() and time out if needed in NioSslEngine.close(). - Since readers/writers may hold locks on these input/output buffers when NioSslEngine.close() is called a reference count is maintained and the buffers are returned to the pool only when the last user is done. - To manage the locking and reference counting a new AutoCloseable ByteBufferSharing interface is introduced with a trivial implementation: ByteBufferSharingNoOp and a real implementation: ByteBufferSharingImpl. - Added a new unit test, and a new concurrency test for ByteBufferSharingImpl: both ensure that ByteBuffers are returned to the pool exactly once. Added a new DUnit test for the interaction between ByteBufferSharingImpl and NioSslEngine and Connection. Co-authored-by: Bill Burcham Co-authored-by: Darrel Schneider Co-authored-by: Ernie Burghardt Co-authored-by: Dan Smith > member hung in Connection.notifyHandshakeWaiter() during disconnect waiting > for a lock held by another thread in Connection.readAck() > -- > > Key: GEODE-8652 > URL: https://issues.apache.org/jira/browse/GEODE-8652 > Project: Geode > Issue Type: Bug > Components: membership, messaging >Affects Versions: 1.12.0, 1.13.0, 1.14.0 >Reporter: Bill Burcham >Assignee: Bill Burcham >Priority: Major > Labels: pull-request-available > Fix For: 1.12.1, 1.14.0, 1.13.1 > > > An application encountered the following hang in a TLS-enabled cluster. > Let's call the cluster members ds3 -> ds1. > ds3 sends a {{PutAllPRMessage}} to ds1 and is stuck in > {{SocketChannel.read()}} waiting for the acknowledgement. > {{ClusterDistributionManager.uncleanShutdown()}} is invoked on ds3 to shut > the member down. That thread blocks trying to acquire a lock on the > {{NioSslEngine}} held by the first thread (the one doing waiting for the ack > to the put-all.) > Somehow the shutdown thread must be allowed to proceed. > Here's the hung thread in ds3 (a.k.a. vm_3_thr_34_dataStore3_host1_8592) > trying to shut down the member but it's stuck waiting for the monitor on the > {{NioSslEngine}}: > {noformat} > "vm_3_thr_34_dataStore3_host1_8592" #795 daemon prio=5 os_prio=0 > tid=0x7fdb9c011000 nid=0x2fcc waiting for monitor entry > [0x7fdb6f4b7000] >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.geode.internal.tcp.Connection.notifyHandshakeWaiter(Connection.java:804) > - waiting to lock <0xf2635b28> (a > org.apache.geode.internal.net.NioSslEngine) > at org.apache.geode.internal.tcp.Connection.close(Connection.java:1350) > at > org.apache.geode.internal.tcp.Connection.closePartialConnect(Connection.java:1278) > at > org.apache.geode.internal.tcp.ConnectionTable.closeCon(ConnectionTable.java:612) > at > org.apache.geode.internal.tcp.ConnectionTable.closeCon(ConnectionTable.java:604) > at > org.apache.geode.internal.tcp.ConnectionTable.close(ConnectionTable.java:661) > - locked <0xf2678cf8> (a java.util.ArrayList) > - locked <0xf1187348> (a java.util.concurrent.ConcurrentHashMap) > at org.apache.geode.internal.tcp.TCPConduit.stop(TCPConduit.java:487) > at > org.apache.geode.distributed.internal.direct.DirectChannel.disconnect(DirectChannel.java:644) > - locked <0xf11867a8> (a > org.apache.geode.distributed.internal.direct.DirectChannel) > at > org.apache.geode.distributed.internal.DistributionImpl.disconnectDirectChannel(DistributionImpl.java:631) > at > org.apache.geode.distributed.internal.DistributionImpl.access$200(DistributionImpl.java:82) > at > org.apache.geode.distributed.internal.DistributionImpl$LifecycleListenerImpl.disconnect(DistributionImpl.java:904) > at > org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.stop(GMSMembership.java:1908) > at >
[jira] [Commented] (GEODE-8681) peer-to-peer message loss due to sending connection closing with TLS enabled
[ https://issues.apache.org/jira/browse/GEODE-8681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227105#comment-17227105 ] ASF GitHub Bot commented on GEODE-8681: --- echobravopapa opened a new pull request #5714: URL: https://github.com/apache/geode/pull/5714 …ng with TLS enabled (#5699) A socket-read could pick up more than one message and a single unwrap() could decrypt multiple messages. Normally the engine isn't closed and it reports normal status from an unwrap() operation, and Connection.processInputBuffer picks up each message, one by one, from the buffer and dispatches them. But if the SSLEngine is closed we were ignoring any already-decrypted data sitting in the unwrapped buffer and instead we were throwing an SSLException. (cherry picked from commit 7da8f9b516ac1e2525a1dfc922af7bfb8995f2c6) Thank you for submitting a contribution to Apache Geode. In order to streamline the review of the contribution we ask you to ensure the following steps have been taken: ### For all changes: - [ ] Is there a JIRA ticket associated with this PR? Is it referenced in the commit message? - [ ] Has your PR been rebased against the latest commit within the target branch (typically `develop`)? - [ ] Is your initial contribution a single, squashed commit? - [ ] Does `gradlew build` run cleanly? - [ ] Have you written or updated unit tests to verify your changes? - [ ] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)? ### Note: Please ensure that once the PR is submitted, check Concourse for build issues and submit an update to your PR as soon as possible. If you need help, please send an email to d...@geode.apache.org. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > peer-to-peer message loss due to sending connection closing with TLS enabled > > > Key: GEODE-8681 > URL: https://issues.apache.org/jira/browse/GEODE-8681 > Project: Geode > Issue Type: Bug > Components: membership, messaging >Affects Versions: 1.10.0, 1.11.0, 1.12.0, 1.13.0 >Reporter: Bruce J Schuchardt >Assignee: Bruce J Schuchardt >Priority: Major > Labels: pull-request-available, release-blocker > > We have observed message loss when TLS is enabled and a distributed lock is > released right after sending a message that doesn't require acknowledgement > if the sending socket is immediately closed. The closing of sockets > immediately after sending a message is frequently seen in function execution > threads or server-side application threads that use this pattern: > {code:java} > try { > DistributedSystem.setThreadsSocketPolicy(false); > acquireDistributedLock(lockName); > (perform one or more cache operations) > } finally { > distLockService.unlock(lockName); > DistributedSystem.releaseThreadsSockets(); // closes the socket > } > {code} > The fault seems to be in NioSSLEngine.unwrap(), which throws an > SSLException() if it finds the SSLEngine is closed even though there is valid > data in its decrypt buffer. It shouldn't throw an exception in that case. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-8466) Create a ClassLoaderService to abstract away dealing with the default ClassLoader directly
[ https://issues.apache.org/jira/browse/GEODE-8466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227111#comment-17227111 ] ASF GitHub Bot commented on GEODE-8466: --- lgtm-com[bot] commented on pull request #5658: URL: https://github.com/apache/geode/pull/5658#issuecomment-722765604 This pull request **introduces 3 alerts** and **fixes 1** when merging 43000b9fa166477601cb64bb14dba9a7439e2c2d into af267c005a63317cbb8528cdb38eccf6a8747818 - [view on LGTM.com](https://lgtm.com/projects/g/apache/geode/rev/pr-8eea1ae86ef56849af481e8b8e63b58fa5cd0422) **new alerts:** * 2 for Potential input resource leak * 1 for Use of a broken or risky cryptographic algorithm **fixed alerts:** * 1 for Use of a broken or risky cryptographic algorithm This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Create a ClassLoaderService to abstract away dealing with the default > ClassLoader directly > -- > > Key: GEODE-8466 > URL: https://issues.apache.org/jira/browse/GEODE-8466 > Project: Geode > Issue Type: New Feature > Components: core >Reporter: Udo Kohlmeyer >Assignee: Udo Kohlmeyer >Priority: Major > Labels: pull-request-available > > With the addition of ClassLoader isolation using JBoss Modules GEODE-8067, > the manner in which we interact with the ClassLoader needs to change. > An abstraction is required around the default functions like > `findResourceAsStream`, `loadClass` and `loadService`. > As these features will behave differently between different ClassLoader > implementations, it is best to have a single service that will expose that > functionality in a transparent manner. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-8603) Potentially expand classes identified for CI stressing to include subclasses
[ https://issues.apache.org/jira/browse/GEODE-8603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227120#comment-17227120 ] ASF subversion and git services commented on GEODE-8603: Commit 61ad7515f2bae57ea8a4966604f8db5778daf99d in geode's branch refs/heads/support/1.13 from Jens Deppe [ https://gitbox.apache.org/repos/asf?p=geode.git;h=61ad751 ] GEODE-8603: Potentially expand classes identified for CI stressing to include subclasses (#5601) (#5674) - Make StressNewTestHelper create the complete gradle test task commands - Since some tests may have subclasses in different source sets, (which would require a different repeat task name), it's easier for the command generation to all happen in the java helper rather than a combination of bash and java. - Include candidate test class if it is not abstract - Output a fake Gradle param so that scripts can determine the number of tests included. - Change the CI stress job timeout from 6 to 10 hours. - Increase the test count threshold from 25 to 35 changed tests. This number also includes any tests inferred by this new code. (cherry picked from commit 4039a363a4b057bca322f29dcf33aa0664f1a912) > Potentially expand classes identified for CI stressing to include subclasses > > > Key: GEODE-8603 > URL: https://issues.apache.org/jira/browse/GEODE-8603 > Project: Geode > Issue Type: Test > Components: ci, tests >Reporter: Jens Deppe >Assignee: Jens Deppe >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-8603) Potentially expand classes identified for CI stressing to include subclasses
[ https://issues.apache.org/jira/browse/GEODE-8603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227121#comment-17227121 ] ASF subversion and git services commented on GEODE-8603: Commit 61ad7515f2bae57ea8a4966604f8db5778daf99d in geode's branch refs/heads/support/1.13 from Jens Deppe [ https://gitbox.apache.org/repos/asf?p=geode.git;h=61ad751 ] GEODE-8603: Potentially expand classes identified for CI stressing to include subclasses (#5601) (#5674) - Make StressNewTestHelper create the complete gradle test task commands - Since some tests may have subclasses in different source sets, (which would require a different repeat task name), it's easier for the command generation to all happen in the java helper rather than a combination of bash and java. - Include candidate test class if it is not abstract - Output a fake Gradle param so that scripts can determine the number of tests included. - Change the CI stress job timeout from 6 to 10 hours. - Increase the test count threshold from 25 to 35 changed tests. This number also includes any tests inferred by this new code. (cherry picked from commit 4039a363a4b057bca322f29dcf33aa0664f1a912) > Potentially expand classes identified for CI stressing to include subclasses > > > Key: GEODE-8603 > URL: https://issues.apache.org/jira/browse/GEODE-8603 > Project: Geode > Issue Type: Test > Components: ci, tests >Reporter: Jens Deppe >Assignee: Jens Deppe >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-8603) Potentially expand classes identified for CI stressing to include subclasses
[ https://issues.apache.org/jira/browse/GEODE-8603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227122#comment-17227122 ] ASF subversion and git services commented on GEODE-8603: Commit 61ad7515f2bae57ea8a4966604f8db5778daf99d in geode's branch refs/heads/support/1.13 from Jens Deppe [ https://gitbox.apache.org/repos/asf?p=geode.git;h=61ad751 ] GEODE-8603: Potentially expand classes identified for CI stressing to include subclasses (#5601) (#5674) - Make StressNewTestHelper create the complete gradle test task commands - Since some tests may have subclasses in different source sets, (which would require a different repeat task name), it's easier for the command generation to all happen in the java helper rather than a combination of bash and java. - Include candidate test class if it is not abstract - Output a fake Gradle param so that scripts can determine the number of tests included. - Change the CI stress job timeout from 6 to 10 hours. - Increase the test count threshold from 25 to 35 changed tests. This number also includes any tests inferred by this new code. (cherry picked from commit 4039a363a4b057bca322f29dcf33aa0664f1a912) > Potentially expand classes identified for CI stressing to include subclasses > > > Key: GEODE-8603 > URL: https://issues.apache.org/jira/browse/GEODE-8603 > Project: Geode > Issue Type: Test > Components: ci, tests >Reporter: Jens Deppe >Assignee: Jens Deppe >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-8603) Potentially expand classes identified for CI stressing to include subclasses
[ https://issues.apache.org/jira/browse/GEODE-8603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227131#comment-17227131 ] ASF subversion and git services commented on GEODE-8603: Commit 9b2aea942d162f6ee43e3a7bcf8e654d5fbb9d3d in geode's branch refs/heads/support/1.12 from Jens Deppe [ https://gitbox.apache.org/repos/asf?p=geode.git;h=9b2aea9 ] GEODE-8603: Potentially expand classes identified for CI stressing to include subclasses (#5601) (#5674) - Make StressNewTestHelper create the complete gradle test task commands - Since some tests may have subclasses in different source sets, (which would require a different repeat task name), it's easier for the command generation to all happen in the java helper rather than a combination of bash and java. - Include candidate test class if it is not abstract - Output a fake Gradle param so that scripts can determine the number of tests included. - Change the CI stress job timeout from 6 to 10 hours. - Increase the test count threshold from 25 to 35 changed tests. This number also includes any tests inferred by this new code. (cherry picked from commit 4039a363a4b057bca322f29dcf33aa0664f1a912) > Potentially expand classes identified for CI stressing to include subclasses > > > Key: GEODE-8603 > URL: https://issues.apache.org/jira/browse/GEODE-8603 > Project: Geode > Issue Type: Test > Components: ci, tests >Reporter: Jens Deppe >Assignee: Jens Deppe >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-8603) Potentially expand classes identified for CI stressing to include subclasses
[ https://issues.apache.org/jira/browse/GEODE-8603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227167#comment-17227167 ] ASF GitHub Bot commented on GEODE-8603: --- onichols-pivotal opened a new pull request #5717: URL: https://github.com/apache/geode/pull/5717 it seems that devBuild is not sufficient to ensure that StressNewTestHelper and all tests are built before running the helper. Therefore, explicitly call `compileTestJava` to ensure that we do the needful regardless of branch. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Potentially expand classes identified for CI stressing to include subclasses > > > Key: GEODE-8603 > URL: https://issues.apache.org/jira/browse/GEODE-8603 > Project: Geode > Issue Type: Test > Components: ci, tests >Reporter: Jens Deppe >Assignee: Jens Deppe >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-8603) Potentially expand classes identified for CI stressing to include subclasses
[ https://issues.apache.org/jira/browse/GEODE-8603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227202#comment-17227202 ] ASF GitHub Bot commented on GEODE-8603: --- onichols-pivotal merged pull request #5717: URL: https://github.com/apache/geode/pull/5717 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Potentially expand classes identified for CI stressing to include subclasses > > > Key: GEODE-8603 > URL: https://issues.apache.org/jira/browse/GEODE-8603 > Project: Geode > Issue Type: Test > Components: ci, tests >Reporter: Jens Deppe >Assignee: Jens Deppe >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-8603) Potentially expand classes identified for CI stressing to include subclasses
[ https://issues.apache.org/jira/browse/GEODE-8603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227205#comment-17227205 ] ASF subversion and git services commented on GEODE-8603: Commit d1e003c822463e20ce43109e6dc3f6f72a110586 in geode's branch refs/heads/develop from Owen Nichols [ https://gitbox.apache.org/repos/asf?p=geode.git;h=d1e003c ] GEODE-8603: fix StressNew for support branches (#5717) * GEODE-8603: fix StressNew for support branches * all three test compile targets are needed > Potentially expand classes identified for CI stressing to include subclasses > > > Key: GEODE-8603 > URL: https://issues.apache.org/jira/browse/GEODE-8603 > Project: Geode > Issue Type: Test > Components: ci, tests >Reporter: Jens Deppe >Assignee: Jens Deppe >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-8603) Potentially expand classes identified for CI stressing to include subclasses
[ https://issues.apache.org/jira/browse/GEODE-8603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227204#comment-17227204 ] ASF subversion and git services commented on GEODE-8603: Commit d1e003c822463e20ce43109e6dc3f6f72a110586 in geode's branch refs/heads/develop from Owen Nichols [ https://gitbox.apache.org/repos/asf?p=geode.git;h=d1e003c ] GEODE-8603: fix StressNew for support branches (#5717) * GEODE-8603: fix StressNew for support branches * all three test compile targets are needed > Potentially expand classes identified for CI stressing to include subclasses > > > Key: GEODE-8603 > URL: https://issues.apache.org/jira/browse/GEODE-8603 > Project: Geode > Issue Type: Test > Components: ci, tests >Reporter: Jens Deppe >Assignee: Jens Deppe >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-8603) Potentially expand classes identified for CI stressing to include subclasses
[ https://issues.apache.org/jira/browse/GEODE-8603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227208#comment-17227208 ] ASF subversion and git services commented on GEODE-8603: Commit d1e003c822463e20ce43109e6dc3f6f72a110586 in geode's branch refs/heads/develop from Owen Nichols [ https://gitbox.apache.org/repos/asf?p=geode.git;h=d1e003c ] GEODE-8603: fix StressNew for support branches (#5717) * GEODE-8603: fix StressNew for support branches * all three test compile targets are needed > Potentially expand classes identified for CI stressing to include subclasses > > > Key: GEODE-8603 > URL: https://issues.apache.org/jira/browse/GEODE-8603 > Project: Geode > Issue Type: Test > Components: ci, tests >Reporter: Jens Deppe >Assignee: Jens Deppe >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-8603) Potentially expand classes identified for CI stressing to include subclasses
[ https://issues.apache.org/jira/browse/GEODE-8603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227207#comment-17227207 ] ASF subversion and git services commented on GEODE-8603: Commit d1e003c822463e20ce43109e6dc3f6f72a110586 in geode's branch refs/heads/develop from Owen Nichols [ https://gitbox.apache.org/repos/asf?p=geode.git;h=d1e003c ] GEODE-8603: fix StressNew for support branches (#5717) * GEODE-8603: fix StressNew for support branches * all three test compile targets are needed > Potentially expand classes identified for CI stressing to include subclasses > > > Key: GEODE-8603 > URL: https://issues.apache.org/jira/browse/GEODE-8603 > Project: Geode > Issue Type: Test > Components: ci, tests >Reporter: Jens Deppe >Assignee: Jens Deppe >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-8603) Potentially expand classes identified for CI stressing to include subclasses
[ https://issues.apache.org/jira/browse/GEODE-8603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227210#comment-17227210 ] ASF subversion and git services commented on GEODE-8603: Commit d1e003c822463e20ce43109e6dc3f6f72a110586 in geode's branch refs/heads/develop from Owen Nichols [ https://gitbox.apache.org/repos/asf?p=geode.git;h=d1e003c ] GEODE-8603: fix StressNew for support branches (#5717) * GEODE-8603: fix StressNew for support branches * all three test compile targets are needed > Potentially expand classes identified for CI stressing to include subclasses > > > Key: GEODE-8603 > URL: https://issues.apache.org/jira/browse/GEODE-8603 > Project: Geode > Issue Type: Test > Components: ci, tests >Reporter: Jens Deppe >Assignee: Jens Deppe >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-8603) Potentially expand classes identified for CI stressing to include subclasses
[ https://issues.apache.org/jira/browse/GEODE-8603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227211#comment-17227211 ] ASF subversion and git services commented on GEODE-8603: Commit d1e003c822463e20ce43109e6dc3f6f72a110586 in geode's branch refs/heads/develop from Owen Nichols [ https://gitbox.apache.org/repos/asf?p=geode.git;h=d1e003c ] GEODE-8603: fix StressNew for support branches (#5717) * GEODE-8603: fix StressNew for support branches * all three test compile targets are needed > Potentially expand classes identified for CI stressing to include subclasses > > > Key: GEODE-8603 > URL: https://issues.apache.org/jira/browse/GEODE-8603 > Project: Geode > Issue Type: Test > Components: ci, tests >Reporter: Jens Deppe >Assignee: Jens Deppe >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-8603) Potentially expand classes identified for CI stressing to include subclasses
[ https://issues.apache.org/jira/browse/GEODE-8603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227213#comment-17227213 ] ASF subversion and git services commented on GEODE-8603: Commit d1e003c822463e20ce43109e6dc3f6f72a110586 in geode's branch refs/heads/develop from Owen Nichols [ https://gitbox.apache.org/repos/asf?p=geode.git;h=d1e003c ] GEODE-8603: fix StressNew for support branches (#5717) * GEODE-8603: fix StressNew for support branches * all three test compile targets are needed > Potentially expand classes identified for CI stressing to include subclasses > > > Key: GEODE-8603 > URL: https://issues.apache.org/jira/browse/GEODE-8603 > Project: Geode > Issue Type: Test > Components: ci, tests >Reporter: Jens Deppe >Assignee: Jens Deppe >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-8603) Potentially expand classes identified for CI stressing to include subclasses
[ https://issues.apache.org/jira/browse/GEODE-8603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227214#comment-17227214 ] ASF subversion and git services commented on GEODE-8603: Commit d1e003c822463e20ce43109e6dc3f6f72a110586 in geode's branch refs/heads/develop from Owen Nichols [ https://gitbox.apache.org/repos/asf?p=geode.git;h=d1e003c ] GEODE-8603: fix StressNew for support branches (#5717) * GEODE-8603: fix StressNew for support branches * all three test compile targets are needed > Potentially expand classes identified for CI stressing to include subclasses > > > Key: GEODE-8603 > URL: https://issues.apache.org/jira/browse/GEODE-8603 > Project: Geode > Issue Type: Test > Components: ci, tests >Reporter: Jens Deppe >Assignee: Jens Deppe >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (GEODE-8603) Potentially expand classes identified for CI stressing to include subclasses
[ https://issues.apache.org/jira/browse/GEODE-8603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227215#comment-17227215 ] ASF subversion and git services commented on GEODE-8603: Commit d1e003c822463e20ce43109e6dc3f6f72a110586 in geode's branch refs/heads/develop from Owen Nichols [ https://gitbox.apache.org/repos/asf?p=geode.git;h=d1e003c ] GEODE-8603: fix StressNew for support branches (#5717) * GEODE-8603: fix StressNew for support branches * all three test compile targets are needed > Potentially expand classes identified for CI stressing to include subclasses > > > Key: GEODE-8603 > URL: https://issues.apache.org/jira/browse/GEODE-8603 > Project: Geode > Issue Type: Test > Components: ci, tests >Reporter: Jens Deppe >Assignee: Jens Deppe >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.3.4#803005)