One other thing is that we have code owners. If the PR submitter decides to 
ignore the stress new test, the code owner can still request a fix. That is 
probably a sufficient process.

On 6/9/21, 10:02 AM, "Mark Hanson" <hans...@vmware.com> wrote:

    I agree, I am willing to concede this discussion, as long as we are 
judicious and specifically only use it when it is not our test that is failing. 
It is a real problem. 

    Thanks,
    Mark
    On 6/9/21, 9:46 AM, "Kirk Lund" <kl...@apache.org> wrote:

        I did think about splitting up dunit tests, but I believe
        testEventIdOutOfOrderInPartitionRegionSingleHop will remain flaky even 
if I
        move it to a new dunit test. No matter how you dice it up, we end up 
with a
        PR that cannot be merged to develop unless you get lucky after running
        stress-new-test many times.

        One could try being tricky by marking it with @ignore or deleting the 
flaky
        test in one PR, and then re-add it in a 2nd PR. But, even then that 2nd 
PR
        is very unlikely to pass stress-new-test unless someone fixes the cause 
of
        that test's flakiness.

        As it stands, stress-new-test just ends up being a dead-end or an 
endless
        time-sync for fixing multiple flaky tests in one dunit.

        On Tue, Jun 8, 2021 at 12:09 PM Dan Smith <dasm...@vmware.com> wrote:

        > Would it be possible to just split that test up into multiple 
classes? It
        > sounds like the issue is that there is so many flaky tests in that 
class
        > that you can't fix them all in one PR, which might indicate it's too 
big.
        >
        > If we can't get StressNewTest to pass - that means our builds are 
failing
        > more than 2% of the time due to this one test failure. Yikes!
        >
        > -Dan
        > ________________________________
        > From: Kirk Lund <kl...@apache.org>
        > Sent: Tuesday, June 8, 2021 9:33 AM
        > To: dev@geode.apache.org <dev@geode.apache.org>
        > Subject: [DISCUSS] Remove stress-new-test-openjdk11 requirement from 
PRs
        >
        > Our requirement for stress-new-test-openjdk11 to pass before allowing 
merge
        > doesn't really work as intended for fixing distributed tests that 
contain
        > multiple flaky test methods. In fact, I think it causes contributors 
to
        > avoid tackling flaky tests.
        >
        > I've been working on GEODE-9103: CI Failure:
        > PutAllClientServerDistributedTest.testPutAllReturnsExceptions FAILED
        > <
        > 
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9103&amp;data=04%7C01%7Chansonm%40vmware.com%7Cbf65c71249894b18853f08d92b6834f9%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588549393164021%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=HJd0ZQcLPldhtt5%2Bw%2FxJSDcTOUxd658Um45esI33BNQ%3D&amp;reserved=0>
        > and was able to fix it.
        >
        > However, stress-new-test-openjdk11 then continued to fail for other 
flaky
        > tests in PutAllClientServerDistributedTest. I managed to fix 
GEODE-9296 and
        > GEODE-8528 as well. I also tried but have not been able to fix 
GEODE-9242
        > which remains flaky.
        >
        > Unfortunately, I cannot merge any of my fixes for
        > PutAllClientServerDistributedTest unless every single flaky test in 
it is
        > fixed by my PR. I think this is undesirable because it would be 
better to
        > merge the fix for 3 flaky test methods than none.
        >
        > UPDATE: After running my precheckin more times, I did get
        > stress-new-test-openjdk11 to pass once so I can merge, but that's 
more of a
        > loophole than anything because I didn't manage to fix GEODE-9242.
        >
        > Despite having PR #6542 eventually pass, I would like to discuss 
removing
        > or relaxing our requirement that stress-new-test-openjdk11 must pass 
in
        > order to merge a PR...
        >
        > PR #6542: GEODE-9103: Fix ServerConnectivityExceptions in
        > PutAllClientServerDistributedTest
        > <
        > 
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode%2Fpull%2F6542&amp;data=04%7C01%7Chansonm%40vmware.com%7Cbf65c71249894b18853f08d92b6834f9%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588549393164021%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=l8OHw3XqetJVrK3Ltp9ZjSUfxvRd5y1087m62eemskA%3D&amp;reserved=0
        > >
        >
        > Fixed in PR #6542:
        > * GEODE-9296: CI Failure: PutAllClientServerDistributedTest >
        > testPartialKeyInPRSingleHopWithRedundancy
        > <
        > 
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9296&amp;data=04%7C01%7Chansonm%40vmware.com%7Cbf65c71249894b18853f08d92b6834f9%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588549393164021%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=SbcGFpxr9KJwnpFflJyP5nKLBqzPE%2BN2bnTTyHFjl7Q%3D&amp;reserved=0
        > >
        > * GEODE-9103: CI Failure:
        > PutAllClientServerDistributedTest.testPutAllReturnsExceptions FAILED
        > <
        > 
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9103&amp;data=04%7C01%7Chansonm%40vmware.com%7Cbf65c71249894b18853f08d92b6834f9%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588549393173973%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=4XtlblZ%2FhLCGRZunt1o%2BpwRhtnQZ%2BFLweZ5%2B%2FdRR%2BlA%3D&amp;reserved=0
        > >
        > * GEODE-8528: 
PutAllClientServerDistributedTest.testPartialKeyInPRSingleHop
        > fails due to missing disk store after server restart
        > <
        > 
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-8528&amp;data=04%7C01%7Chansonm%40vmware.com%7Cbf65c71249894b18853f08d92b6834f9%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588549393173973%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=LYOOFXczCGemBLV%2FdREukaexvV93%2FmnlUzZamdsWzqI%3D&amp;reserved=0
        > >
        >
        > Still flaky:
        > * GEODE-9242: CI failure in PutAllClientServerDistributedTest >
        > testEventIdOutOfOrderInPartitionRegionSingleHop
        > <
        > 
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9242&amp;data=04%7C01%7Chansonm%40vmware.com%7Cbf65c71249894b18853f08d92b6834f9%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588549393173973%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=voQKjCBsArlWaAvLVSpwRSLdp0yqcOjo6NAGBF9sP%2B4%3D&amp;reserved=0
        > >
        >
        > Thanks,
        > Kirk
        >


Reply via email to