One other thing to think about is perhaps having a rotating team to deal with 
flaky tests,  a small team commissioned every three or 6 months to clear out 
flaky tests for 1 month. It is good experience

Thanks,
Mark

On 6/9/21, 10:04 AM, "Mark Hanson" <hans...@vmware.com> wrote:

    One other thing is that we have code owners. If the PR submitter decides to 
ignore the stress new test, the code owner can still request a fix. That is 
probably a sufficient process.

    On 6/9/21, 10:02 AM, "Mark Hanson" <hans...@vmware.com> wrote:

        I agree, I am willing to concede this discussion, as long as we are 
judicious and specifically only use it when it is not our test that is failing. 
It is a real problem. 

        Thanks,
        Mark
        On 6/9/21, 9:46 AM, "Kirk Lund" <kl...@apache.org> wrote:

            I did think about splitting up dunit tests, but I believe
            testEventIdOutOfOrderInPartitionRegionSingleHop will remain flaky 
even if I
            move it to a new dunit test. No matter how you dice it up, we end 
up with a
            PR that cannot be merged to develop unless you get lucky after 
running
            stress-new-test many times.

            One could try being tricky by marking it with @ignore or deleting 
the flaky
            test in one PR, and then re-add it in a 2nd PR. But, even then that 
2nd PR
            is very unlikely to pass stress-new-test unless someone fixes the 
cause of
            that test's flakiness.

            As it stands, stress-new-test just ends up being a dead-end or an 
endless
            time-sync for fixing multiple flaky tests in one dunit.

            On Tue, Jun 8, 2021 at 12:09 PM Dan Smith <dasm...@vmware.com> 
wrote:

            > Would it be possible to just split that test up into multiple 
classes? It
            > sounds like the issue is that there is so many flaky tests in 
that class
            > that you can't fix them all in one PR, which might indicate it's 
too big.
            >
            > If we can't get StressNewTest to pass - that means our builds are 
failing
            > more than 2% of the time due to this one test failure. Yikes!
            >
            > -Dan
            > ________________________________
            > From: Kirk Lund <kl...@apache.org>
            > Sent: Tuesday, June 8, 2021 9:33 AM
            > To: dev@geode.apache.org <dev@geode.apache.org>
            > Subject: [DISCUSS] Remove stress-new-test-openjdk11 requirement 
from PRs
            >
            > Our requirement for stress-new-test-openjdk11 to pass before 
allowing merge
            > doesn't really work as intended for fixing distributed tests that 
contain
            > multiple flaky test methods. In fact, I think it causes 
contributors to
            > avoid tackling flaky tests.
            >
            > I've been working on GEODE-9103: CI Failure:
            > PutAllClientServerDistributedTest.testPutAllReturnsExceptions 
FAILED
            > <
            > 
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9103&amp;data=04%7C01%7Chansonm%40vmware.com%7C800517c549824631351108d92b68aaa2%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588550805554080%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=SJNMBA%2Fdu0w1%2BFINgw8tTVJhbro%2BIYQ3rGx6dgToWkk%3D&amp;reserved=0>
            > and was able to fix it.
            >
            > However, stress-new-test-openjdk11 then continued to fail for 
other flaky
            > tests in PutAllClientServerDistributedTest. I managed to fix 
GEODE-9296 and
            > GEODE-8528 as well. I also tried but have not been able to fix 
GEODE-9242
            > which remains flaky.
            >
            > Unfortunately, I cannot merge any of my fixes for
            > PutAllClientServerDistributedTest unless every single flaky test 
in it is
            > fixed by my PR. I think this is undesirable because it would be 
better to
            > merge the fix for 3 flaky test methods than none.
            >
            > UPDATE: After running my precheckin more times, I did get
            > stress-new-test-openjdk11 to pass once so I can merge, but that's 
more of a
            > loophole than anything because I didn't manage to fix GEODE-9242.
            >
            > Despite having PR #6542 eventually pass, I would like to discuss 
removing
            > or relaxing our requirement that stress-new-test-openjdk11 must 
pass in
            > order to merge a PR...
            >
            > PR #6542: GEODE-9103: Fix ServerConnectivityExceptions in
            > PutAllClientServerDistributedTest
            > <
            > 
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode%2Fpull%2F6542&amp;data=04%7C01%7Chansonm%40vmware.com%7C800517c549824631351108d92b68aaa2%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588550805554080%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=SNHVauGZeJaNr%2FexPEfcjZU8ci%2FJ6UKXlh8Jb8F0nqQ%3D&amp;reserved=0
            > >
            >
            > Fixed in PR #6542:
            > * GEODE-9296: CI Failure: PutAllClientServerDistributedTest >
            > testPartialKeyInPRSingleHopWithRedundancy
            > <
            > 
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9296&amp;data=04%7C01%7Chansonm%40vmware.com%7C800517c549824631351108d92b68aaa2%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588550805554080%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=yuk0ixdDUeqI%2BJgB1vmafUh2KrlYLhb1mtgzbklBXTE%3D&amp;reserved=0
            > >
            > * GEODE-9103: CI Failure:
            > PutAllClientServerDistributedTest.testPutAllReturnsExceptions 
FAILED
            > <
            > 
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9103&amp;data=04%7C01%7Chansonm%40vmware.com%7C800517c549824631351108d92b68aaa2%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588550805554080%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=SJNMBA%2Fdu0w1%2BFINgw8tTVJhbro%2BIYQ3rGx6dgToWkk%3D&amp;reserved=0
            > >
            > * GEODE-8528: 
PutAllClientServerDistributedTest.testPartialKeyInPRSingleHop
            > fails due to missing disk store after server restart
            > <
            > 
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-8528&amp;data=04%7C01%7Chansonm%40vmware.com%7C800517c549824631351108d92b68aaa2%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588550805564035%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=2BrAfkArsXmTtjE1cwak41WhTVsiZjYYnVjcK6TrJS8%3D&amp;reserved=0
            > >
            >
            > Still flaky:
            > * GEODE-9242: CI failure in PutAllClientServerDistributedTest >
            > testEventIdOutOfOrderInPartitionRegionSingleHop
            > <
            > 
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9242&amp;data=04%7C01%7Chansonm%40vmware.com%7C800517c549824631351108d92b68aaa2%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588550805564035%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=8nszzc01SnwsXR9MxNZ461LB0CYcdKlBdX6A7ihC74c%3D&amp;reserved=0
            > >
            >
            > Thanks,
            > Kirk
            >



Reply via email to