One other thing to think about is perhaps having a rotating team to deal with flaky tests, a small team commissioned every three or 6 months to clear out flaky tests for 1 month. It is good experience
Thanks, Mark On 6/9/21, 10:04 AM, "Mark Hanson" <hans...@vmware.com> wrote: One other thing is that we have code owners. If the PR submitter decides to ignore the stress new test, the code owner can still request a fix. That is probably a sufficient process. On 6/9/21, 10:02 AM, "Mark Hanson" <hans...@vmware.com> wrote: I agree, I am willing to concede this discussion, as long as we are judicious and specifically only use it when it is not our test that is failing. It is a real problem. Thanks, Mark On 6/9/21, 9:46 AM, "Kirk Lund" <kl...@apache.org> wrote: I did think about splitting up dunit tests, but I believe testEventIdOutOfOrderInPartitionRegionSingleHop will remain flaky even if I move it to a new dunit test. No matter how you dice it up, we end up with a PR that cannot be merged to develop unless you get lucky after running stress-new-test many times. One could try being tricky by marking it with @ignore or deleting the flaky test in one PR, and then re-add it in a 2nd PR. But, even then that 2nd PR is very unlikely to pass stress-new-test unless someone fixes the cause of that test's flakiness. As it stands, stress-new-test just ends up being a dead-end or an endless time-sync for fixing multiple flaky tests in one dunit. On Tue, Jun 8, 2021 at 12:09 PM Dan Smith <dasm...@vmware.com> wrote: > Would it be possible to just split that test up into multiple classes? It > sounds like the issue is that there is so many flaky tests in that class > that you can't fix them all in one PR, which might indicate it's too big. > > If we can't get StressNewTest to pass - that means our builds are failing > more than 2% of the time due to this one test failure. Yikes! > > -Dan > ________________________________ > From: Kirk Lund <kl...@apache.org> > Sent: Tuesday, June 8, 2021 9:33 AM > To: dev@geode.apache.org <dev@geode.apache.org> > Subject: [DISCUSS] Remove stress-new-test-openjdk11 requirement from PRs > > Our requirement for stress-new-test-openjdk11 to pass before allowing merge > doesn't really work as intended for fixing distributed tests that contain > multiple flaky test methods. In fact, I think it causes contributors to > avoid tackling flaky tests. > > I've been working on GEODE-9103: CI Failure: > PutAllClientServerDistributedTest.testPutAllReturnsExceptions FAILED > < > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9103&data=04%7C01%7Chansonm%40vmware.com%7C800517c549824631351108d92b68aaa2%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588550805554080%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=SJNMBA%2Fdu0w1%2BFINgw8tTVJhbro%2BIYQ3rGx6dgToWkk%3D&reserved=0> > and was able to fix it. > > However, stress-new-test-openjdk11 then continued to fail for other flaky > tests in PutAllClientServerDistributedTest. I managed to fix GEODE-9296 and > GEODE-8528 as well. I also tried but have not been able to fix GEODE-9242 > which remains flaky. > > Unfortunately, I cannot merge any of my fixes for > PutAllClientServerDistributedTest unless every single flaky test in it is > fixed by my PR. I think this is undesirable because it would be better to > merge the fix for 3 flaky test methods than none. > > UPDATE: After running my precheckin more times, I did get > stress-new-test-openjdk11 to pass once so I can merge, but that's more of a > loophole than anything because I didn't manage to fix GEODE-9242. > > Despite having PR #6542 eventually pass, I would like to discuss removing > or relaxing our requirement that stress-new-test-openjdk11 must pass in > order to merge a PR... > > PR #6542: GEODE-9103: Fix ServerConnectivityExceptions in > PutAllClientServerDistributedTest > < > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode%2Fpull%2F6542&data=04%7C01%7Chansonm%40vmware.com%7C800517c549824631351108d92b68aaa2%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588550805554080%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=SNHVauGZeJaNr%2FexPEfcjZU8ci%2FJ6UKXlh8Jb8F0nqQ%3D&reserved=0 > > > > Fixed in PR #6542: > * GEODE-9296: CI Failure: PutAllClientServerDistributedTest > > testPartialKeyInPRSingleHopWithRedundancy > < > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9296&data=04%7C01%7Chansonm%40vmware.com%7C800517c549824631351108d92b68aaa2%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588550805554080%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=yuk0ixdDUeqI%2BJgB1vmafUh2KrlYLhb1mtgzbklBXTE%3D&reserved=0 > > > * GEODE-9103: CI Failure: > PutAllClientServerDistributedTest.testPutAllReturnsExceptions FAILED > < > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9103&data=04%7C01%7Chansonm%40vmware.com%7C800517c549824631351108d92b68aaa2%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588550805554080%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=SJNMBA%2Fdu0w1%2BFINgw8tTVJhbro%2BIYQ3rGx6dgToWkk%3D&reserved=0 > > > * GEODE-8528: PutAllClientServerDistributedTest.testPartialKeyInPRSingleHop > fails due to missing disk store after server restart > < > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-8528&data=04%7C01%7Chansonm%40vmware.com%7C800517c549824631351108d92b68aaa2%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588550805564035%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=2BrAfkArsXmTtjE1cwak41WhTVsiZjYYnVjcK6TrJS8%3D&reserved=0 > > > > Still flaky: > * GEODE-9242: CI failure in PutAllClientServerDistributedTest > > testEventIdOutOfOrderInPartitionRegionSingleHop > < > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9242&data=04%7C01%7Chansonm%40vmware.com%7C800517c549824631351108d92b68aaa2%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588550805564035%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=8nszzc01SnwsXR9MxNZ461LB0CYcdKlBdX6A7ihC74c%3D&reserved=0 > > > > Thanks, > Kirk >