[ https://issues.apache.org/jira/browse/GEODE-9531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17434502#comment-17434502 ]
Dale Emery commented on GEODE-9531: ----------------------------------- I was curious about the warnings from the stat sampling thread, so I checked a bunch of runs with failures. Eight of those failure runs had swarms of those warnings. By "swarm" I mean that multiple tests issued that warning at about the same time (within a second or two). In all eight of those runs, the following tests were executing at the time of the first warning: # org.apache.geode.security.ClientAuthorizationCQDUnitTest testAllOpsWithFailover2 # org.apache.geode.management.GfshRebalanceCommandCompatibilityTest whenCurrentVersionLocatorsExecuteRebalanceOnOldServersThenItMustSucceed # org.apache.geode.management.ConfigurationCompatibilityTest whenConfigurationIsExchangedBetweenMixedVersionLocatorsThenItShouldNotThrowExceptions # org.apache.geode.cache.wan.WANRollingUpgradeSecondaryEventsNotReprocessedAfterOldSiteMemberFailover testSecondaryEventsNotReprocessedAfterOldSiteMemberFailover # org.apache.geode.cache.wan.WANRollingUpgradeSecondaryEventsNotReprocessedAfterCurrentSiteMemberFailoverWithOldClient testSecondaryEventsNotReprocessedAfterCurrentSiteMemberFailoverWithOldClient # org.apache.geode.cache.wan.WANRollingUpgradeEventProcessingOldSiteOneCurrentSiteTwo testEventProcessingOldSiteOneCurrentSiteTwo # org.apache.geode.cache.wan.WANRollingUpgradeEventProcessingMixedSiteOneOldSiteTwo EventProcessingMixedSiteOneOldSiteTwo # org.apache.geode.cache.wan.WANRollingUpgradeEventProcessingMixedSiteOneCurrentSiteTwo EventProcessingMixedSiteOneCurrentSiteTwo # org.apache.geode.cache.wan.WANRollingUpgradeCreateGatewaySenderMixedSiteOneCurrentSiteTwo CreateGatewaySenderMixedSiteOneCurrentSiteTwo # org.apache.geode.cache.lucene.RollingUpgradeReindexShouldBeSuccessfulWhenAllServersRollToCurrentVersion luceneReindexShouldBeSuccessfulWhenAllServersRollToCurrentVersion # org.apache.geode.cache.lucene.RollingUpgradeQueryReturnsCorrectResultsAfterServersRollOverOnPersistentPartitionRegion luceneQueryReturnsCorrectResultsAfterServersRollOverOnPersistentPartitionRegion # org.apache.geode.cache.lucene.RollingUpgradeQueryReturnsCorrectResultsAfterServersRollOverOnPartitionRegion luceneQueryReturnsCorrectResultsAfterServersRollOverOnPartitionRegion # org.apache.geode.cache.lucene.RollingUpgradeQueryReturnsCorrectResultsAfterClientAndServersAreRolledOverAllBucketsCreated # org.apache.geode.cache.lucene.RollingUpgradeQueryReturnsCorrectResultsAfterClientAndServersAreRolledOver luceneQueryReturnsCorrectResultsAfterClientAndServersAreRolledOver # org.apache.geode.cache.lucene.RollingUpgradeQueryReturnsCorrectResultAfterTwoLocatorsWithTwoServersAreRolled luceneQueryReturnsCorrectResultAfterTwoLocatorsWithTwoServersAreRolled Perhaps one of these tests is doing something unusually CPU intensive. Given that mosts tests succeeded even after emitting the warning, I may be able to prune this list of tests by analyzing "green" jobs that have those warnings. > CI Failure: TxCommitMessageBCClientToServerTxPartitionTest fails with > ForcedDisconnectException > ----------------------------------------------------------------------------------------------- > > Key: GEODE-9531 > URL: https://issues.apache.org/jira/browse/GEODE-9531 > Project: Geode > Issue Type: Bug > Affects Versions: 1.14.0 > Reporter: Donal Evans > Assignee: Eric Shu > Priority: Major > Labels: GeodeOperationAPI > > {noformat} > org.apache.geode.internal.cache.TxCommitMessageBCClientToServerTxPartitionTest > > test[11] FAILED > org.apache.geode.test.dunit.RMIException: While invoking > org.apache.geode.internal.cache.TxCommitMessageBCTestBase$$Lambda$55/2050040059.run > in VM 2 running on Host 1797ac7f43c4 with 5 VMs > Caused by: > org.apache.geode.distributed.DistributedSystemDisconnectedException: > membership shutdown, caused by org.apache.geode.ForcedDisconnectException: > Member isn't responding to heartbeat requests > Caused by: > org.apache.geode.ForcedDisconnectException: Member isn't > responding to heartbeat requests > java.lang.AssertionError: Suspicious strings were written to the log > during this run. > Fix the strings or use IgnoredException.addIgnoredException to ignore. > ----------------------------------------------------------------------- > Found suspect string in 'dunit_suspect-vm2.log' at line 993 > [fatal 2021/05/25 16:58:13.700 GMT <unicast receiver,1797ac7f43c4-36391> > tid=1349] Membership service failure: Member isn't responding to heartbeat > requests > > org.apache.geode.distributed.internal.membership.api.MemberDisconnectedException: > Member isn't responding to heartbeat requests > at > org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.forceDisconnect(GMSMembership.java:1783) > at > org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.forceDisconnect(GMSJoinLeave.java:1122) > at > org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.processRemoveMemberMessage(GMSJoinLeave.java:725) > at > org.apache.geode.distributed.internal.membership.gms.messenger.JGroupsMessenger$JGroupsReceiver.receive(JGroupsMessenger.java:1366) > at > org.apache.geode.distributed.internal.membership.gms.messenger.JGroupsMessenger$JGroupsReceiver.receive(JGroupsMessenger.java:1302) > at org.jgroups.JChannel.invokeCallback(JChannel.java:816) > at org.jgroups.JChannel.up(JChannel.java:741) > at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:1030) > at org.jgroups.protocols.FRAG2.up(FRAG2.java:165) > at org.jgroups.protocols.FlowControl.up(FlowControl.java:390) > at org.jgroups.protocols.UNICAST3.deliverMessage(UNICAST3.java:1077) > at org.jgroups.protocols.UNICAST3.handleDataReceived(UNICAST3.java:792) > at org.jgroups.protocols.UNICAST3.up(UNICAST3.java:433) > at > org.apache.geode.distributed.internal.membership.gms.messenger.StatRecorder.up(StatRecorder.java:72) > at > org.apache.geode.distributed.internal.membership.gms.messenger.AddressManager.up(AddressManager.java:70) > at org.jgroups.protocols.TP.passMessageUp(TP.java:1658) > at org.jgroups.protocols.TP$SingleMessageHandler.run(TP.java:1876) > at org.jgroups.util.DirectExecutor.execute(DirectExecutor.java:10) > at org.jgroups.protocols.TP.handleSingleMessage(TP.java:1789) > at org.jgroups.protocols.TP.receive(TP.java:1714) > at > org.apache.geode.distributed.internal.membership.gms.messenger.Transport.receive(Transport.java:159) > at org.jgroups.protocols.UDP$PacketReceiver.run(UDP.java:701) > at java.lang.Thread.run(Thread.java:748) > ----------------------------------------------------------------------- > Found suspect string in 'dunit_suspect-vm2.log' at line 1041 > [error 2021/05/25 16:58:14.206 GMT <RMI TCP Connection(23)-172.17.0.39> > tid=135] Cache initialization for GemFireCache[id = 664332017; isClosing = > false; isShutDownAll = false; created = Tue May 25 16:57:54 GMT 2021; server > = false; copyOnRead = false; lockLease = 120; lockTimeout = 60] failed > because: > org.apache.geode.distributed.DistributedSystemDisconnectedException: > membership shutdown, caused by org.apache.geode.ForcedDisconnectException: > Member isn't responding to heartbeat requests > at > org.apache.geode.distributed.internal.DistributionImpl.checkCancelled(DistributionImpl.java:313) > at > org.apache.geode.distributed.internal.DistributionImpl.send(DistributionImpl.java:243) > at > org.apache.geode.distributed.internal.ClusterDistributionManager.sendViaMembershipManager(ClusterDistributionManager.java:2053) > at > org.apache.geode.distributed.internal.ClusterDistributionManager.sendOutgoing(ClusterDistributionManager.java:1981) > at > org.apache.geode.distributed.internal.ClusterDistributionManager.sendMessage(ClusterDistributionManager.java:2018) > at > org.apache.geode.distributed.internal.ClusterDistributionManager.putOutgoing(ClusterDistributionManager.java:1083) > at > org.apache.geode.internal.cache.CreateRegionProcessor.initializeRegion(CreateRegionProcessor.java:115) > at > org.apache.geode.internal.cache.DistributedRegion.getInitialImageAndRecovery(DistributedRegion.java:1161) > at > org.apache.geode.internal.cache.DistributedRegion.initialize(DistributedRegion.java:1092) > at > org.apache.geode.internal.cache.GemFireCacheImpl.createVMRegion(GemFireCacheImpl.java:3104) > at > org.apache.geode.internal.cache.InternalRegionFactory.create(InternalRegionFactory.java:78) > at > org.apache.geode.pdx.internal.PeerTypeRegistration.initialize(PeerTypeRegistration.java:202) > at > org.apache.geode.pdx.internal.TypeRegistry.initialize(TypeRegistry.java:116) > at > org.apache.geode.internal.cache.GemFireCacheImpl.initializePdxRegistry(GemFireCacheImpl.java:1671) > at > org.apache.geode.internal.cache.GemFireCacheImpl.initializeDeclarativeCache(GemFireCacheImpl.java:1605) > at > org.apache.geode.internal.cache.GemFireCacheImpl.initialize(GemFireCacheImpl.java:1448) > at > org.apache.geode.internal.cache.InternalCacheBuilder.create(InternalCacheBuilder.java:191) > at > org.apache.geode.internal.cache.CacheFactoryStatics.create(CacheFactoryStatics.java:61) > at org.apache.geode.cache.CacheFactory.create(CacheFactory.java:352) > at > org.apache.geode.internal.cache.TxCommitMessageBCTestBase.createServerCacheWithPool(TxCommitMessageBCTestBase.java:187) > at > org.apache.geode.internal.cache.TxCommitMessageBCTestBase.lambda$postSetUp$384cd611$1(TxCommitMessageBCTestBase.java:117) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.geode.test.dunit.internal.MethodInvoker.executeObject(MethodInvoker.java:123) > at > org.apache.geode.test.dunit.internal.RemoteDUnitVM.executeMethodOnObject(RemoteDUnitVM.java:78) > at sun.reflect.GeneratedMethodAccessor333.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:357) > at sun.rmi.transport.Transport$1.run(Transport.java:200) > at sun.rmi.transport.Transport$1.run(Transport.java:197) > at java.security.AccessController.doPrivileged(Native Method) > at sun.rmi.transport.Transport.serviceCall(Transport.java:196) > at > sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:573) > at > sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:834) > at > sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.lambda$run$0(TCPTransport.java:688) > at java.security.AccessController.doPrivileged(Native Method) > at > sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:687) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: org.apache.geode.ForcedDisconnectException: Member isn't > responding to heartbeat requests > at > org.apache.geode.distributed.internal.DistributionImpl.checkCancelled(DistributionImpl.java:312) > ... 42 more > ----------------------------------------------------------------------- > Found suspect string in 'dunit_suspect-vm2.log' at line 1130 > [error 2021/05/25 16:58:15.274 GMT <RMI TCP Connection(23)-172.17.0.39> > tid=135] org.apache.geode.distributed.DistributedSystemDisconnectedException: > membership shutdown, caused by org.apache.geode.ForcedDisconnectException: > Member isn't responding to heartbeat requests > 576 tests completed, 1 failed, 36 skipped > {noformat} > =-=-=-=-=-=-=-=-=-=-=-=-=-=-= Test Results URI > =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= > http://files.apachegeode-ci.info/builds/apache-support-1-14-main/1.14.0-build.0787/test-results/upgradeTest/1621966586/ > =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= > Test report artifacts from this job are available at: > http://files.apachegeode-ci.info/builds/apache-support-1-14-main/1.14.0-build.0787/test-artifacts/1621966586/upgradetestfiles-OpenJDK8-1.14.0-build.0787.tgz -- This message was sent by Atlassian Jira (v8.3.4#803005)