Check [1] to see what ‘flaky’ tests have failed recently. Anthony
[1] https://concourse.apachegeode-ci.info/teams/main/pipelines/develop-metrics/jobs/GeodeFlakyTestMetrics/builds/51 <https://concourse.apachegeode-ci.info/teams/main/pipelines/develop-metrics/jobs/GeodeFlakyTestMetrics/builds/51> > On Jul 6, 2018, at 6:56 AM, Jinmei Liao <jil...@pivotal.io> wrote: > > +1 for removing flaky category and fix as failure occurs. > > On Thu, Jul 5, 2018 at 8:21 PM Dan Smith <dsm...@pivotal.io> wrote: > >> Honestly I've never liked the flaky category. What it means is that at some >> point in the past, we decided to put off tracking down and fixing a failure >> and now we're left with a bug number and a description and that's it. >> >> I think we will be better off if we just get rid of the flaky category >> entirely. That way no one can label anything else as flaky and push it off >> for later, and if flaky tests fail again we will actually prioritize and >> fix them instead of ignoring them. >> >> I think Patrick was looking at rerunning the flaky tests to see what is >> still failing. How about we just run the whole flaky suite some number of >> times (100?), fix whatever is still failing and close out and remove the >> category from the rest? >> >> I think will we get more benefit from shaking out and fixing the issues we >> have in the current codebase than we will from carefully explaining the >> flaky failures from the past. >> >> -Dan >> >> On Thu, Jul 5, 2018 at 7:03 PM, Dale Emery <dem...@pivotal.io> wrote: >> >>> Hi Alexander and all, >>> >>>> On Jul 5, 2018, at 5:11 PM, Alexander Murmann <amurm...@pivotal.io> >>> wrote: >>>> >>>> Hi everyone! >>>> >>>> Dan Smith started a discussion about shaking out more flaky DUnit >> tests. >>>> That's a great effort and I am happy it's happening. >>>> >>>> As a corollary to that conversation I wonder what the criteria should >> be >>>> for a test to not be considered flaky any longer and have the category >>>> removed. >>>> >>>> In general the bar should be fairly high. Even if a test only fails ~1 >> in >>>> 500 runs that's still a problem given how many tests we have. >>>> >>>> I see two ends of the spectrum: >>>> 1. We have a good understanding why the test was flaky and think we >> fixed >>>> it. >>>> 2. We have a hard time reproducing the flaky behavior and have no good >>>> theory as to why the test might have shown flaky behavior. >>>> >>>> In the first case I'd suggest to run the test ~100 times to get a >> little >>>> more confidence that we fixed the flaky behavior and then remove the >>>> category. >>> >>> Here’s a test for case 1: >>> >>> If we really understand why it was flaky, we will be able to: >>> - Identify the “faults”—the broken places in the code (whether system >>> code or test code). >>> - Identify the exact conditions under which those faults led to the >>> failures we observed. >>> - Explain how those faults, under those conditions. led to those >>> failures. >>> - Run unit tests that exercise the code under those same conditions, >>> and demonstrate that >>> the formerly broken code now does the right thing. >>> >>> If we’re lacking any of these things, I’d say we’re dealing with case 2. >>> >>>> The second case is a lot more problematic. How often do we want to run >> a >>>> test like that before we decide that it might have been fixed since we >>> last >>>> saw it happen? Anything else we could/should do to verify the test >>> deserves >>>> our trust again? >>> >>> >>> I would want a clear, compelling explanation of the failures we observed. >>> >>> Clear and compelling are subjective, of course. For me, clear and >>> compelling would include >>> descriptions of: >>> - The faults in the code. What, specifically, was broken. >>> - The specific conditions under which the code did the wrong thing. >>> - How those faults, under those conditions, led to those failures. >>> - How the fix either prevents those conditions, or causes the formerly >>> broken code to >>> now do the right thing. >>> >>> Even if we don’t have all of these elements, we may have some of them. >>> That can help us >>> calibrate our confidence. But the elements work together. If we’re >> lacking >>> one, the others >>> are shaky, to some extent. >>> >>> The more elements are missing in our explanation, the more times I’d want >>> to run the test >>> before trusting it. >>> >>> Cheers, >>> Dale >>> >>> — >>> Dale Emery >>> dem...@pivotal.io >>> >>> >> > > > -- > Cheers > > Jinmei