[ https://issues.apache.org/jira/browse/GEODE-10330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Nabarun Nag updated GEODE-10330: -------------------------------- Fix Version/s: 1.16.0 > Resource issues lead to "MemberDisconnectedException: Member isn't responding > to heartbeat requests" > ---------------------------------------------------------------------------------------------------- > > Key: GEODE-10330 > URL: https://issues.apache.org/jira/browse/GEODE-10330 > Project: Geode > Issue Type: Bug > Affects Versions: 1.16.0 > Reporter: Donal Evans > Assignee: Nabarun Nag > Priority: Major > Labels: needsTriage > Fix For: 1.16.0 > > > A failure was observed in > DistributedMulticastRegionWithUDPSecurityDUnitTest > > testMulticastAfterReconnect due to suspect strings with fatal-level logging > of "Membership service failure: Member isn't responding to heartbeat > requests". > Investigating the logs showed all members reporting long statistics sampling > wakeup delays, indicating resource issues: > {code:java} > [vm3] [warn 2022/05/21 07:28:16.251 UTC LocatorWithMcast <StatSampler> > tid=0xb8] Statistics sampling thread detected a wakeup delay of 4760 ms, > indicating a possible resource issue. Check the GC, memory, and CPU > statistics. > ... > [locator] [warn 2022/05/21 07:28:20.288 UTC <StatSampler> tid=0x3b] > Statistics sampling thread detected a wakeup delay of 12400 ms, indicating a > possible resource issue. Check the GC, memory, and CPU statistics. > ... > [vm1] [warn 2022/05/21 07:28:20.969 UTC vm1 <StatSampler> tid=0xda] > Statistics sampling thread detected a wakeup delay of 13738 ms, indicating a > possible resource issue. Check the GC, memory, and CPU statistics. > ... > [vm0] [warn 2022/05/21 07:28:22.226 UTC vm0 <StatSampler> tid=0xa9] > Statistics sampling thread detected a wakeup delay of 15110 ms, indicating a > possible resource issue. Check the GC, memory, and CPU statistics. {code} > > After downloading the test artifacts and using the progress tool from the > dev-tools directory in the Geode repository, the following tests were found > to be running during the resource issues, possibly indicating that one or > more of them are particularly resource-intensive: > {noformat} > $> progress -r '2022-05-21 07:28:16.251 -0000' | grep org | sort{noformat} > {code:java} > org.apache.geode.cache.PRCacheListenerWithInterestPolicyAllDistributedTest.afterUpdateIsInvokedInEveryMember[0: > redundancy=0] > org.apache.geode.cache.lucene.LuceneQueriesReindexDUnitTest.recreateIndexWithDifferentFieldsShouldFail(PARTITION_OVERFLOW_TO_DISK) > [2] > org.apache.geode.cache.query.cq.dunit.CqDataUsingPoolOptimizedExecuteDUnitTest.testCQHAWithState > > org.apache.geode.cache.query.cq.dunit.PartitionedRegionCqQueryDUnitTest.testPartitionedCqOnAccessorBridgeServer > org.apache.geode.cache30.CallbackArgDUnitTest.testForCA > org.apache.geode.cache30.DistributedMulticastRegionWithUDPSecurityDUnitTest.testMulticastAfterReconnect > > org.apache.geode.cache30.DistributedNoAckRegionCCEOffHeapDUnitTest.testDistributedInvalidate > org.apache.geode.cache30.GlobalRegionOffHeapDUnitTest.testOrderedUpdates > org.apache.geode.cache30.ReconnectWithClusterConfigurationDUnitTest.testReconnectAfterMeltdown > > org.apache.geode.distributed.internal.P2PMessagingConcurrencyDUnitTest.testP2PMessaging(true, > false, 32768, 65536) [6] > org.apache.geode.disttx.PRDistTXDUnitTest.testSimulaneousChildRegionCreation > org.apache.geode.internal.cache.ClientServerTransactionCCEDUnitTest.testClientCommitFunctionWithFailure > > org.apache.geode.internal.cache.eviction.OffHeapEvictionStatsDUnitTest.testHeapLruCounter > > org.apache.geode.internal.cache.wan.concurrent.ConcurrentParallelGatewaySenderOperation_1_DUnitTest.testParallelPropagationSenderStartAfterStopOnAccessorNode > > org.apache.geode.internal.cache.wan.offheap.ParallelGatewaySenderOperationsOffHeapDistributedTest.testParallelGatewaySenderStartOnAccessorNode > > org.apache.geode.internal.cache.wan.serial.SerialWANPropagation_PartitionedRegionDUnitTest.testPartitionedSerialPropagationHA > org.apache.geode.internal.tcp.TCPConduitDUnitTest.basicAcceptConnection[0] > org.apache.geode.management.internal.configuration.ClusterConfigImportDUnitTest.importFailWithExistingRegion > > org.apache.geode.rest.internal.web.controllers.RestAPIsOnGroupsFunctionExecutionDUnitTest.testBasicP2PFunctionSelectedGroup[1] > > org.apache.geode.session.tests.Jetty9CachingClientServerTest.failureShouldStillAllowOtherContainersDataAccess > > org.apache.geode.session.tests.Tomcat8ClientServerCustomCacheXmlTest.containersShouldExpireInSetTimeframe > org.apache.geode.session.tests.Tomcat8Test.containersShouldReplicateCookies > org.apache.geode.session.tests.Tomcat9ClientServerTest.invalidationShouldRemoveValueAccessForAllContainers > {code} > Future failures due to this sort of resource issue should also list > concurrently running tests so that repeat appearances by individual tests can > be used to identify the culprits. -- This message was sent by Atlassian Jira (v8.20.7#820007)