Thanks Kirk for tackling some of our flakiest tests!  I agree, we don't want to 
discourage anyone interested in paying down tech debt.

The Geode community has spoken clearly against bypassing or weakening required 
PR checks, so relaxing requirements in general might be a tough sell, but I'm 
curious what you had in mind.

It might be easier to try to get consensus on the dev list to approve a 
one-time exception, when a solid argument can be made that all reasonable 
alternatives have been exhausted (e.g. splitting up the tests or PR, 
re-triggering several times, running with increased timeout, etc) and the risk 
is low (e.g. test-only fixes or reverts).

On 6/8/21, 9:33 AM, "Kirk Lund" <kl...@apache.org> wrote:

    Our requirement for stress-new-test-openjdk11 to pass before allowing merge
    doesn't really work as intended for fixing distributed tests that contain
    multiple flaky test methods. In fact, I think it causes contributors to
    avoid tackling flaky tests.

    I've been working on GEODE-9103: CI Failure:
    PutAllClientServerDistributedTest.testPutAllReturnsExceptions FAILED
    
<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9103&amp;data=04%7C01%7Conichols%40vmware.com%7Cb59c0fd23cc14802ff6708d92a9b29d4%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C637587668177121955%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=wxvXQmgGu%2FVL%2Bn84QzumhDCVJwVHIeXfH9CnuJTOoS0%3D&amp;reserved=0>
 and was able to fix it.

    However, stress-new-test-openjdk11 then continued to fail for other flaky
    tests in PutAllClientServerDistributedTest. I managed to fix GEODE-9296 and
    GEODE-8528 as well. I also tried but have not been able to fix GEODE-9242
    which remains flaky.

    Unfortunately, I cannot merge any of my fixes for
    PutAllClientServerDistributedTest unless every single flaky test in it is
    fixed by my PR. I think this is undesirable because it would be better to
    merge the fix for 3 flaky test methods than none.

    UPDATE: After running my precheckin more times, I did get
    stress-new-test-openjdk11 to pass once so I can merge, but that's more of a
    loophole than anything because I didn't manage to fix GEODE-9242.

    Despite having PR #6542 eventually pass, I would like to discuss removing
    or relaxing our requirement that stress-new-test-openjdk11 must pass in
    order to merge a PR...

    PR #6542: GEODE-9103: Fix ServerConnectivityExceptions in
    PutAllClientServerDistributedTest
    
<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode%2Fpull%2F6542&amp;data=04%7C01%7Conichols%40vmware.com%7Cb59c0fd23cc14802ff6708d92a9b29d4%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C637587668177121955%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=PnThYY0f8z7kX3nGIT9x4gqlIU%2B3NIRMlh%2B3XNphw5c%3D&amp;reserved=0>

    Fixed in PR #6542:
    * GEODE-9296: CI Failure: PutAllClientServerDistributedTest >
    testPartialKeyInPRSingleHopWithRedundancy
    
<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9296&amp;data=04%7C01%7Conichols%40vmware.com%7Cb59c0fd23cc14802ff6708d92a9b29d4%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C637587668177121955%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=7bFHlYbkJ%2B1NIVnasz5%2B9ykmX8f766S13s6or82tPNs%3D&amp;reserved=0>
    * GEODE-9103: CI Failure:
    PutAllClientServerDistributedTest.testPutAllReturnsExceptions FAILED
    
<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9103&amp;data=04%7C01%7Conichols%40vmware.com%7Cb59c0fd23cc14802ff6708d92a9b29d4%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C637587668177121955%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=wxvXQmgGu%2FVL%2Bn84QzumhDCVJwVHIeXfH9CnuJTOoS0%3D&amp;reserved=0>
    * GEODE-8528: PutAllClientServerDistributedTest.testPartialKeyInPRSingleHop
    fails due to missing disk store after server restart
    
<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-8528&amp;data=04%7C01%7Conichols%40vmware.com%7Cb59c0fd23cc14802ff6708d92a9b29d4%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C637587668177121955%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=31oBR7OYsFvJqSbO7mTFfGC7pA9fptCqBOcN8AoA9a0%3D&amp;reserved=0>

    Still flaky:
    * GEODE-9242: CI failure in PutAllClientServerDistributedTest >
    testEventIdOutOfOrderInPartitionRegionSingleHop
    
<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9242&amp;data=04%7C01%7Conichols%40vmware.com%7Cb59c0fd23cc14802ff6708d92a9b29d4%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C637587668177121955%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=GovuvljP%2FX%2BeR6GohGIhfwFBeAbfwUdjqeHtMKFOzes%3D&amp;reserved=0>

    Thanks,
    Kirk

Reply via email to