I should add that I'm only in favor of deleting the category if we have a new policy of any failure means we have to fix the test and/or product code. Even if you think that failure is in a test that you or your team is not responsible for. That's no excuse to ignore a failure in your private precheckin.
On Fri, Jul 6, 2018 at 9:29 AM, Dale Emery <dem...@pivotal.io> wrote: > The pattern I’ve seen in lots of other organizations: When a few tests > intermittently give different answers, people attribute the intermittence > to the tests, quickly lose trust in the entire suite, and increasingly > discount failures. > > If we’re going to attend to every failure in the larger suite, then we > won’t suffer that fate, and I’m in favor of deleting the Flaky tag. > > Dale > > > On Jul 5, 2018, at 8:15 PM, Dan Smith <dsm...@pivotal.io> wrote: > > > > Honestly I've never liked the flaky category. What it means is that at > some > > point in the past, we decided to put off tracking down and fixing a > failure > > and now we're left with a bug number and a description and that's it. > > > > I think we will be better off if we just get rid of the flaky category > > entirely. That way no one can label anything else as flaky and push it > off > > for later, and if flaky tests fail again we will actually prioritize and > > fix them instead of ignoring them. > > > > I think Patrick was looking at rerunning the flaky tests to see what is > > still failing. How about we just run the whole flaky suite some number of > > times (100?), fix whatever is still failing and close out and remove the > > category from the rest? > > > > I think will we get more benefit from shaking out and fixing the issues > we > > have in the current codebase than we will from carefully explaining the > > flaky failures from the past. > > > > -Dan > > > > On Thu, Jul 5, 2018 at 7:03 PM, Dale Emery <dem...@pivotal.io> wrote: > > > >> Hi Alexander and all, > >> > >>> On Jul 5, 2018, at 5:11 PM, Alexander Murmann <amurm...@pivotal.io> > >> wrote: > >>> > >>> Hi everyone! > >>> > >>> Dan Smith started a discussion about shaking out more flaky DUnit > tests. > >>> That's a great effort and I am happy it's happening. > >>> > >>> As a corollary to that conversation I wonder what the criteria should > be > >>> for a test to not be considered flaky any longer and have the category > >>> removed. > >>> > >>> In general the bar should be fairly high. Even if a test only fails ~1 > in > >>> 500 runs that's still a problem given how many tests we have. > >>> > >>> I see two ends of the spectrum: > >>> 1. We have a good understanding why the test was flaky and think we > fixed > >>> it. > >>> 2. We have a hard time reproducing the flaky behavior and have no good > >>> theory as to why the test might have shown flaky behavior. > >>> > >>> In the first case I'd suggest to run the test ~100 times to get a > little > >>> more confidence that we fixed the flaky behavior and then remove the > >>> category. > >> > >> Here’s a test for case 1: > >> > >> If we really understand why it was flaky, we will be able to: > >> - Identify the “faults”—the broken places in the code (whether system > >> code or test code). > >> - Identify the exact conditions under which those faults led to the > >> failures we observed. > >> - Explain how those faults, under those conditions. led to those > >> failures. > >> - Run unit tests that exercise the code under those same conditions, > >> and demonstrate that > >> the formerly broken code now does the right thing. > >> > >> If we’re lacking any of these things, I’d say we’re dealing with case 2. > >> > >>> The second case is a lot more problematic. How often do we want to run > a > >>> test like that before we decide that it might have been fixed since we > >> last > >>> saw it happen? Anything else we could/should do to verify the test > >> deserves > >>> our trust again? > >> > >> > >> I would want a clear, compelling explanation of the failures we > observed. > >> > >> Clear and compelling are subjective, of course. For me, clear and > >> compelling would include > >> descriptions of: > >> - The faults in the code. What, specifically, was broken. > >> - The specific conditions under which the code did the wrong thing. > >> - How those faults, under those conditions, led to those failures. > >> - How the fix either prevents those conditions, or causes the formerly > >> broken code to > >> now do the right thing. > >> > >> Even if we don’t have all of these elements, we may have some of them. > >> That can help us > >> calibrate our confidence. But the elements work together. If we’re > lacking > >> one, the others > >> are shaky, to some extent. > >> > >> The more elements are missing in our explanation, the more times I’d > want > >> to run the test > >> before trusting it. > >> > >> Cheers, > >> Dale > >> > >> — > >> Dale Emery > >> dem...@pivotal.io > >> > >> > > — > Dale Emery > dem...@pivotal.io > > > > >