Thanks for looking into this, Mario! You are probably right, that the underlying issue might have been pre-existing and that the test is surfacing it. I am glad though that you are investigating, because a close to 30% fail rate is a problem. Something like this happens every once in a while and then someone has to do some work to resolve historical problems that they hadn't planned to addres.
Thanks! On Tue, Jul 14, 2020 at 1:06 AM Mario Ivanac <mario.iva...@est.tech> wrote: > Hi, > > after adding additional checks in failing test, now I can see that test > are failing due to fault that some batch are distributed at stopping of GW > sender. > Cause of that, I suspect that this problem existed prior to this PR, but > this PR is first to introduce test to check this. > > I will continue to investigate this fault, but I can not locally reproduce > this fault, so this is slowing troubleshooting. > > BR, > Mario > ________________________________ > Å alje: Alexander Murmann <amurm...@apache.org> > Poslano: 14. srpnja 2020. 1:11 > Prima: Alexander Murmann <amurm...@apache.org> > Kopija: dev@geode.apache.org <dev@geode.apache.org>; Mario Ivanac > <mario.iva...@est.tech> > Predmet: Re: [INFO] Latest test run of 200 DistributedTestOpenJDK8 passes > > We continue to see these WAN tests adding a fail rate of just below 30% in > our mass test runs > < > https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-develop-mass-test-run/jobs/create-mass-test-run-report/builds/10 > > > . > > That's a very significant fail rate that impacts our ability to get our > code committed with confidence. > > Can we resolve this issue? Otherwise, I think we need to consider reverting > GEODE-7458. > > On Fri, Jun 19, 2020 at 3:28 PM Alexander Murmann <amurm...@apache.org> > wrote: > > > Looking more into this, it looks like this was introduced by the changes > > for GEODE-7458 - "Adding additional option in gfsh command "start gateway > > sender" to control clearing of existing queues". > > > > That happened about a month ago, but it's inherent to those flaky tests > > that we discover them only after a while. Nonetheless, they become paper > > cuts that ultimately slow us down substantially if they persist. > > > > @Mario Ivanac If I am correct and GEODE-7458 introduced this you were the > > one making that change. Might you be able to take a look at making that > > test more reliable or reverting the change? > > > > Thank you! > > > > On Fri, Jun 19, 2020 at 7:57 AM Alexander Murmann <amurm...@apache.org> > > wrote: > > > >> Thank you so much for sharing this, Mark! > >> > >> It looks like there is a big cluster around WAN Gateway. Is anyone > >> already looking into the WAN issues? > >> > >> On Thu, Jun 18, 2020 at 10:06 PM Mark Hanson <hans...@vmware.com> > wrote: > >> > >>> FYI, the build success rate was around 90% or so about two months ago. > >>> > >>> Here are the DUnit tests that are currently failing in our tests, most > >>> likely in CI, and PR pipelines. > >>> > >>> Please let me know if you have any questions. > >>> > >>> Thanks, > >>> Mark > >>> > >>> > >>> > >>> > *********************************************************************************** > >>> > >>> Overall build success rate: 78.00000% (156 of 200) > >>> > >>> > >>> > *********************************************************************************** > >>> > >>> > >>> > >>> The following test methods see failures in more than one class. There > >>> may be a failing *TestBase class > >>> > >>> > >>> > >>> > *.testReplicatedRegionPersistentWanGateway_restartSenderWithCleanQueues_expectNoEventsReceived: > >>> 12 failures : > >>> > >>> SerialWANPersistenceEnabledGatewaySenderDUnitTest: 8 failures > >>> (96.000% success rate) > >>> > >>> SerialWANPersistenceEnabledGatewaySenderOffHeapDUnitTest: 4 failures > >>> (98.000% success rate) > >>> > >>> > >>> > >>> > *.testpersistentWanGateway_restartSenderWithCleanQueues_expectNoEventsReceived: > >>> 12 failures : > >>> > >>> ParallelWANPersistenceEnabledGatewaySenderOffHeapDUnitTest: 5 > >>> failures (97.500% success rate) > >>> > >>> ParallelWANPersistenceEnabledGatewaySenderDUnitTest: 7 failures > >>> (96.500% success rate) > >>> > >>> > >>> > >>> *.testPingWrongServer: 4 failures : > >>> > >>> ClientServerMiscSelectorDUnitTest: 3 failures (98.500% success rate) > >>> > >>> ClientServerMiscDUnitTest: 1 failures (99.500% success rate) > >>> > >>> > >>> > >>> > >>> > *********************************************************************************** > >>> > >>> > >>> > >>> > >>> > >>> > org.apache.geode.internal.cache.wan.serial.SerialWANPersistenceEnabledGatewaySenderDUnitTest: > >>> 8 failures (96.000% success rate) > >>> > >>> > >>> > >>> > >>> > testReplicatedRegionPersistentWanGateway_restartSenderWithCleanQueues_expectNoEventsReceived > >>> > >>> > https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-mass-test-run-main/jobs/DistributedTestOpenJDK8/builds/3539 > >>> > >>> > >>> > testReplicatedRegionPersistentWanGateway_restartSenderWithCleanQueues_expectNoEventsReceived > >>> > >>> > https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-mass-test-run-main/jobs/DistributedTestOpenJDK8/builds/3526 > >>> > >>> > >>> > testReplicatedRegionPersistentWanGateway_restartSenderWithCleanQueues_expectNoEventsReceived > >>> > >>> > https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-mass-test-run-main/jobs/DistributedTestOpenJDK8/builds/3505 > >>> > >>> > >>> > testReplicatedRegionPersistentWanGateway_restartSenderWithCleanQueues_expectNoEventsReceived > >>> > >>> > https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-mass-test-run-main/jobs/DistributedTestOpenJDK8/builds/3435 > >>> > >>> > >>> > testReplicatedRegionPersistentWanGateway_restartSenderWithCleanQueues_expectNoEventsReceived > >>> > >>> > https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-mass-test-run-main/jobs/DistributedTestOpenJDK8/builds/3414 > >>> > >>> > >>> > testReplicatedRegionPersistentWanGateway_restartSenderWithCleanQueues_expectNoEventsReceived > >>> > >>> > https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-mass-test-run-main/jobs/DistributedTestOpenJDK8/builds/3391 > >>> > >>> > >>> > testReplicatedRegionPersistentWanGateway_restartSenderWithCleanQueues_expectNoEventsReceived > >>> > >>> > https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-mass-test-run-main/jobs/DistributedTestOpenJDK8/builds/3363 > >>> > >>> > >>> > testReplicatedRegionPersistentWanGateway_restartSenderWithCleanQueues_expectNoEventsReceived > >>> > >>> > https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-mass-test-run-main/jobs/DistributedTestOpenJDK8/builds/3359 > >>> > >>> > >>> > >>> org.apache.geode.management.MemberMXBeanDistributedTest: 2 failures > >>> (99.000% success rate) > >>> > >>> > >>> > >>> testBucketCount > >>> > https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-mass-test-run-main/jobs/DistributedTestOpenJDK8/builds/3463 > >>> > >>> testBucketCount > >>> > https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-mass-test-run-main/jobs/DistributedTestOpenJDK8/builds/3411 > >>> > >>> > >>> > >>> > org.apache.geode.internal.cache.tier.sockets.RedundancyLevelPart3DUnitTest: > >>> 1 failures (99.500% success rate) > >>> > >>> > >>> > >>> testRegisterInterestAndMakePrimaryWithFullRedundancy > >>> > https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-mass-test-run-main/jobs/DistributedTestOpenJDK8/builds/3381 > >>> > >>> > >>> > >>> > org.apache.geode.management.internal.cli.commands.QueryCommandOverHttpDUnitTest: > >>> 1 failures (99.500% success rate) > >>> > >>> > >>> > >>> testSimpleQueryOnLocator > >>> > https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-mass-test-run-main/jobs/DistributedTestOpenJDK8/builds/3516 > >>> > >>> > >>> > >>> org.apache.geode.internal.cache.tier.sockets.ClientServerMiscDUnitTest: > >>> 1 failures (99.500% success rate) > >>> > >>> > >>> > >>> testPingWrongServer > >>> > https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-mass-test-run-main/jobs/DistributedTestOpenJDK8/builds/3400 > >>> > >>> > >>> > >>> org.apache.geode.internal.cache.wan.serial.SerialWANStatsDUnitTest: 3 > >>> failures (98.500% success rate) > >>> > >>> > >>> > >>> > >>> > testReplicatedSerialPropagationWithGroupTransactionEventsSendsBatchesWithCompleteTransactions > >>> > >>> > https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-mass-test-run-main/jobs/DistributedTestOpenJDK8/builds/3495 > >>> > >>> > >>> > testReplicatedSerialPropagationWithGroupTransactionEventsSendsBatchesWithCompleteTransactions > >>> > >>> > https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-mass-test-run-main/jobs/DistributedTestOpenJDK8/builds/3470 > >>> > >>> testReplicatedSerialPropagationHAWithGroupTransactionEvents > >>> > >>> > https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-mass-test-run-main/jobs/DistributedTestOpenJDK8/builds/3479 > >>> > >>> > >>> > >>> > org.apache.geode.distributed.internal.deadlock.GemFireDeadlockDetectorDUnitTest: > >>> 1 failures (99.500% success rate) > >>> > >>> > >>> > >>> testDistributedDeadlockWithDLock > >>> > https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-mass-test-run-main/jobs/DistributedTestOpenJDK8/builds/3421 > >>> > >>> > >>> > >>> org.apache.geode.distributed.LocatorDUnitTest: 1 failures (99.500% > >>> success rate) > >>> > >>> > >>> > >>> testStartTwoLocatorsWithMultiKeystoreSSL > >>> > https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-mass-test-run-main/jobs/DistributedTestOpenJDK8/builds/3398 > >>> > >>> > >>> > >>> > org.apache.geode.internal.cache.wan.offheap.ParallelWANPersistenceEnabledGatewaySenderOffHeapDUnitTest: > >>> 5 failures (97.500% success rate) > >>> > >>> > >>> > >>> > >>> > testpersistentWanGateway_restartSenderWithCleanQueues_expectNoEventsReceived > >>> > >>> > https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-mass-test-run-main/jobs/DistributedTestOpenJDK8/builds/3531 > >>> > >>> > >>> > testpersistentWanGateway_restartSenderWithCleanQueues_expectNoEventsReceived > >>> > >>> > https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-mass-test-run-main/jobs/DistributedTestOpenJDK8/builds/3522 > >>> > >>> > >>> > testpersistentWanGateway_restartSenderWithCleanQueues_expectNoEventsReceived > >>> > >>> > https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-mass-test-run-main/jobs/DistributedTestOpenJDK8/builds/3456 > >>> > >>> > >>> > testpersistentWanGateway_restartSenderWithCleanQueues_expectNoEventsReceived > >>> > >>> > https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-mass-test-run-main/jobs/DistributedTestOpenJDK8/builds/3439 > >>> > >>> > >>> > testpersistentWanGateway_restartSenderWithCleanQueues_expectNoEventsReceived > >>> > >>> > https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-mass-test-run-main/jobs/DistributedTestOpenJDK8/builds/3405 > >>> > >>> > >>> > >>> > org.apache.geode.internal.cache.ClientServerTransactionFailoverWithMixedVersionServersDistributedTest: > >>> 1 failures (99.500% success rate) > >>> > >>> > >>> > >>> > >>> clientTransactionOperationsAreNotLostIfTransactionIsOnRolledServer > >>> > https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-mass-test-run-main/jobs/DistributedTestOpenJDK8/builds/3433 > >>> > >>> > >>> > >>> > org.apache.geode.internal.cache.wan.parallel.ParallelWANPersistenceEnabledGatewaySenderDUnitTest: > >>> 7 failures (96.500% success rate) > >>> > >>> > >>> > >>> > >>> > testpersistentWanGateway_restartSenderWithCleanQueues_expectNoEventsReceived > >>> > >>> > https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-mass-test-run-main/jobs/DistributedTestOpenJDK8/builds/3478 > >>> > >>> > >>> > testpersistentWanGateway_restartSenderWithCleanQueues_expectNoEventsReceived > >>> > >>> > https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-mass-test-run-main/jobs/DistributedTestOpenJDK8/builds/3463 > >>> > >>> > >>> > testpersistentWanGateway_restartSenderWithCleanQueues_expectNoEventsReceived > >>> > >>> > https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-mass-test-run-main/jobs/DistributedTestOpenJDK8/builds/3439 > >>> > >>> > >>> > testpersistentWanGateway_restartSenderWithCleanQueues_expectNoEventsReceived > >>> > >>> > https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-mass-test-run-main/jobs/DistributedTestOpenJDK8/builds/3436 > >>> > >>> > >>> > testpersistentWanGateway_restartSenderWithCleanQueues_expectNoEventsReceived > >>> > >>> > https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-mass-test-run-main/jobs/DistributedTestOpenJDK8/builds/3405 > >>> > >>> > >>> > testpersistentWanGateway_restartSenderWithCleanQueues_expectNoEventsReceived > >>> > >>> > https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-mass-test-run-main/jobs/DistributedTestOpenJDK8/builds/3370 > >>> > >>> > >>> > testpersistentWanGateway_restartSenderWithCleanQueues_expectNoEventsReceived > >>> > >>> > https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-mass-test-run-main/jobs/DistributedTestOpenJDK8/builds/3351 > >>> > >>> > >>> > >>> > org.apache.geode.internal.cache.partitioned.PersistentColocatedPartitionedRegionDistributedTest: > >>> 4 failures (98.000% success rate) > >>> > >>> > >>> > >>> testMultipleColocatedChildPRsMissingWithSequencedStart > >>> > https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-mass-test-run-main/jobs/DistributedTestOpenJDK8/builds/3493 > >>> > >>> testMissingColocatedChildPRDueToDelayedStart > >>> > https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-mass-test-run-main/jobs/DistributedTestOpenJDK8/builds/3503 > >>> > >>> testHierarchyOfColocatedChildPRsMissingGrandchild > >>> > https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-mass-test-run-main/jobs/DistributedTestOpenJDK8/builds/3537 > >>> > >>> testHierarchyOfColocatedChildPRsMissingGrandchild > >>> > https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-mass-test-run-main/jobs/DistributedTestOpenJDK8/builds/3424 > >>> > >>> > >>> > >>> org.apache.geode.distributed.DistributedMemberDUnitTest: 2 failures > >>> (99.000% success rate) > >>> > >>> > >>> > >>> testGroupsInAllVMs > >>> > https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-mass-test-run-main/jobs/DistributedTestOpenJDK8/builds/3423 > >>> > >>> testGroupsInAllVMs > >>> > https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-mass-test-run-main/jobs/DistributedTestOpenJDK8/builds/3374 > >>> > >>> > >>> > >>> org.apache.geode.management.JMXMBeanReconnectDUnitTest: 5 failures > >>> (97.500% success rate) > >>> > >>> > >>> > >>> serverMXBeansOnServerAreUnaffectedByLocatorCrash > >>> > https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-mass-test-run-main/jobs/DistributedTestOpenJDK8/builds/3503 > >>> > >>> > >>> serverMXBeansAreRestoredOnBothLocatorsAfterCrashedServerReturns > >>> > https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-mass-test-run-main/jobs/DistributedTestOpenJDK8/builds/3436 > >>> > >>> > >>> serverMXBeansAreRestoredOnBothLocatorsAfterCrashedServerReturns > >>> > https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-mass-test-run-main/jobs/DistributedTestOpenJDK8/builds/3406 > >>> > >>> locatorHasMemberTypeMXBeansForBothLocators > >>> > https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-mass-test-run-main/jobs/DistributedTestOpenJDK8/builds/3352 > >>> > >>> serverMXBeansOnLocatorAreRestoredAfterCrashedServerReturns > >>> > >>> > https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-mass-test-run-main/jobs/DistributedTestOpenJDK8/builds/3457 > >>> > >>> > >>> > >>> > org.apache.geode.internal.cache.wan.offheap.SerialWANPersistenceEnabledGatewaySenderOffHeapDUnitTest: > >>> 4 failures (98.000% success rate) > >>> > >>> > >>> > >>> > >>> > testReplicatedRegionPersistentWanGateway_restartSenderWithCleanQueues_expectNoEventsReceived > >>> > >>> > https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-mass-test-run-main/jobs/DistributedTestOpenJDK8/builds/3494 > >>> > >>> > >>> > testReplicatedRegionPersistentWanGateway_restartSenderWithCleanQueues_expectNoEventsReceived > >>> > >>> > https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-mass-test-run-main/jobs/DistributedTestOpenJDK8/builds/3465 > >>> > >>> > >>> > testReplicatedRegionPersistentWanGateway_restartSenderWithCleanQueues_expectNoEventsReceived > >>> > >>> > https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-mass-test-run-main/jobs/DistributedTestOpenJDK8/builds/3464 > >>> > >>> > >>> > testReplicatedRegionPersistentWanGateway_restartSenderWithCleanQueues_expectNoEventsReceived > >>> > >>> > https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-mass-test-run-main/jobs/DistributedTestOpenJDK8/builds/3379 > >>> > >>> > >>> > >>> > org.apache.geode.internal.cache.tier.sockets.ClientServerMiscSelectorDUnitTest: > >>> 3 failures (98.500% success rate) > >>> > >>> > >>> > >>> testPingWrongServer > >>> > https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-mass-test-run-main/jobs/DistributedTestOpenJDK8/builds/3402 > >>> > >>> testPingWrongServer > >>> > https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-mass-test-run-main/jobs/DistributedTestOpenJDK8/builds/3380 > >>> > >>> testPingWrongServer > >>> > https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-mass-test-run-main/jobs/DistributedTestOpenJDK8/builds/3357 > >>> > >>> >