I should add that I'm only in favor of deleting the category if we have a
new policy of any failure means we have to fix the test and/or product
code. Even if you think that failure is in a test that you or your team is
not responsible for. That's no excuse to ignore a failure in your private
precheckin.

On Fri, Jul 6, 2018 at 9:29 AM, Dale Emery <dem...@pivotal.io> wrote:

> The pattern I’ve seen in lots of other organizations: When a few tests
> intermittently give different answers, people attribute the intermittence
> to the tests, quickly lose trust in the entire suite, and increasingly
> discount failures.
>
> If we’re going to attend to every failure in the larger suite, then we
> won’t suffer that fate, and I’m in favor of deleting the Flaky tag.
>
> Dale
>
> > On Jul 5, 2018, at 8:15 PM, Dan Smith <dsm...@pivotal.io> wrote:
> >
> > Honestly I've never liked the flaky category. What it means is that at
> some
> > point in the past, we decided to put off tracking down and fixing a
> failure
> > and now we're left with a bug number and a description and that's it.
> >
> > I think we will be better off if we just get rid of the flaky category
> > entirely. That way no one can label anything else as flaky and push it
> off
> > for later, and if flaky tests fail again we will actually prioritize and
> > fix them instead of ignoring them.
> >
> > I think Patrick was looking at rerunning the flaky tests to see what is
> > still failing. How about we just run the whole flaky suite some number of
> > times (100?), fix whatever is still failing and close out and remove the
> > category from the rest?
> >
> > I think will we get more benefit from shaking out and fixing the issues
> we
> > have in the current codebase than we will from carefully explaining the
> > flaky failures from the past.
> >
> > -Dan
> >
> > On Thu, Jul 5, 2018 at 7:03 PM, Dale Emery <dem...@pivotal.io> wrote:
> >
> >> Hi Alexander and all,
> >>
> >>> On Jul 5, 2018, at 5:11 PM, Alexander Murmann <amurm...@pivotal.io>
> >> wrote:
> >>>
> >>> Hi everyone!
> >>>
> >>> Dan Smith started a discussion about shaking out more flaky DUnit
> tests.
> >>> That's a great effort and I am happy it's happening.
> >>>
> >>> As a corollary to that conversation I wonder what the criteria should
> be
> >>> for a test to not be considered flaky any longer and have the category
> >>> removed.
> >>>
> >>> In general the bar should be fairly high. Even if a test only fails ~1
> in
> >>> 500 runs that's still a problem given how many tests we have.
> >>>
> >>> I see two ends of the spectrum:
> >>> 1. We have a good understanding why the test was flaky and think we
> fixed
> >>> it.
> >>> 2. We have a hard time reproducing the flaky behavior and have no good
> >>> theory as to why the test might have shown flaky behavior.
> >>>
> >>> In the first case I'd suggest to run the test ~100 times to get a
> little
> >>> more confidence that we fixed the flaky behavior and then remove the
> >>> category.
> >>
> >> Here’s a test for case 1:
> >>
> >> If we really understand why it was flaky, we will be able to:
> >>    - Identify the “faults”—the broken places in the code (whether system
> >> code or test code).
> >>    - Identify the exact conditions under which those faults led to the
> >> failures we observed.
> >>    - Explain how those faults, under those conditions. led to those
> >> failures.
> >>    - Run unit tests that exercise the code under those same conditions,
> >> and demonstrate that
> >>      the formerly broken code now does the right thing.
> >>
> >> If we’re lacking any of these things, I’d say we’re dealing with case 2.
> >>
> >>> The second case is a lot more problematic. How often do we want to run
> a
> >>> test like that before we decide that it might have been fixed since we
> >> last
> >>> saw it happen? Anything else we could/should do to verify the test
> >> deserves
> >>> our trust again?
> >>
> >>
> >> I would want a clear, compelling explanation of the failures we
> observed.
> >>
> >> Clear and compelling are subjective, of course. For me, clear and
> >> compelling would include
> >> descriptions of:
> >>   - The faults in the code. What, specifically, was broken.
> >>   - The specific conditions under which the code did the wrong thing.
> >>   - How those faults, under those conditions, led to those failures.
> >>   - How the fix either prevents those conditions, or causes the formerly
> >> broken code to
> >>     now do the right thing.
> >>
> >> Even if we don’t have all of these elements, we may have some of them.
> >> That can help us
> >> calibrate our confidence. But the elements work together. If we’re
> lacking
> >> one, the others
> >> are shaky, to some extent.
> >>
> >> The more elements are missing in our explanation, the more times I’d
> want
> >> to run the test
> >> before trusting it.
> >>
> >> Cheers,
> >> Dale
> >>
> >> —
> >> Dale Emery
> >> dem...@pivotal.io
> >>
> >>
>
> —
> Dale Emery
> dem...@pivotal.io
>
>
>
>
>

Reply via email to