I think the basic problem is that we have too much tech debt in the form of 
dunit tests. We are proposing all of these "workarounds" to avoid dealing with 
the core problem.

On 6/8/21, 12:09 PM, "Dan Smith" <dasm...@vmware.com> wrote:

    Would it be possible to just split that test up into multiple classes? It 
sounds like the issue is that there is so many flaky tests in that class that 
you can't fix them all in one PR, which might indicate it's too big.

    If we can't get StressNewTest to pass - that means our builds are failing 
more than 2% of the time due to this one test failure. Yikes!

    -Dan
    ________________________________
    From: Kirk Lund <kl...@apache.org>
    Sent: Tuesday, June 8, 2021 9:33 AM
    To: dev@geode.apache.org <dev@geode.apache.org>
    Subject: [DISCUSS] Remove stress-new-test-openjdk11 requirement from PRs

    Our requirement for stress-new-test-openjdk11 to pass before allowing merge
    doesn't really work as intended for fixing distributed tests that contain
    multiple flaky test methods. In fact, I think it causes contributors to
    avoid tackling flaky tests.

    I've been working on GEODE-9103: CI Failure:
    PutAllClientServerDistributedTest.testPutAllReturnsExceptions FAILED
    
<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9103&amp;data=04%7C01%7Chansonm%40vmware.com%7Cd125737262af4ab5bcdf08d92ab0f019%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637587761714858048%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=8AiHXLVBvN1woMC4Ow5VHBgkkmtsh1BJu7tpTyIu8KE%3D&amp;reserved=0>
 and was able to fix it.

    However, stress-new-test-openjdk11 then continued to fail for other flaky
    tests in PutAllClientServerDistributedTest. I managed to fix GEODE-9296 and
    GEODE-8528 as well. I also tried but have not been able to fix GEODE-9242
    which remains flaky.

    Unfortunately, I cannot merge any of my fixes for
    PutAllClientServerDistributedTest unless every single flaky test in it is
    fixed by my PR. I think this is undesirable because it would be better to
    merge the fix for 3 flaky test methods than none.

    UPDATE: After running my precheckin more times, I did get
    stress-new-test-openjdk11 to pass once so I can merge, but that's more of a
    loophole than anything because I didn't manage to fix GEODE-9242.

    Despite having PR #6542 eventually pass, I would like to discuss removing
    or relaxing our requirement that stress-new-test-openjdk11 must pass in
    order to merge a PR...

    PR #6542: GEODE-9103: Fix ServerConnectivityExceptions in
    PutAllClientServerDistributedTest
    
<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode%2Fpull%2F6542&amp;data=04%7C01%7Chansonm%40vmware.com%7Cd125737262af4ab5bcdf08d92ab0f019%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637587761714858048%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=n1lGURJOJcObxOUdULjQpMY44OdXOYmULmlZLzxIeQc%3D&amp;reserved=0>

    Fixed in PR #6542:
    * GEODE-9296: CI Failure: PutAllClientServerDistributedTest >
    testPartialKeyInPRSingleHopWithRedundancy
    
<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9296&amp;data=04%7C01%7Chansonm%40vmware.com%7Cd125737262af4ab5bcdf08d92ab0f019%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637587761714868006%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=%2B%2Baskz2pZv9tlhwyHCN6hzpMvGJQIB25uRSPmpU0W84%3D&amp;reserved=0>
    * GEODE-9103: CI Failure:
    PutAllClientServerDistributedTest.testPutAllReturnsExceptions FAILED
    
<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9103&amp;data=04%7C01%7Chansonm%40vmware.com%7Cd125737262af4ab5bcdf08d92ab0f019%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637587761714868006%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=MPM8Dl02u9nRN7rKIDGah8MAWhIPEymtrRcPtAAa9Fk%3D&amp;reserved=0>
    * GEODE-8528: PutAllClientServerDistributedTest.testPartialKeyInPRSingleHop
    fails due to missing disk store after server restart
    
<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-8528&amp;data=04%7C01%7Chansonm%40vmware.com%7Cd125737262af4ab5bcdf08d92ab0f019%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637587761714868006%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=G9KG9Cndd7I4D1iVjDuK8%2FVVRgkFtVRpGopQFI2zmr4%3D&amp;reserved=0>

    Still flaky:
    * GEODE-9242: CI failure in PutAllClientServerDistributedTest >
    testEventIdOutOfOrderInPartitionRegionSingleHop
    
<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9242&amp;data=04%7C01%7Chansonm%40vmware.com%7Cd125737262af4ab5bcdf08d92ab0f019%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637587761714868006%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=QYE5f1dG9L5u7Nylw5az%2BMrnMOS%2Bdai0Q2FQeTUFZvw%3D&amp;reserved=0>

    Thanks,
    Kirk

Reply via email to