+1 go ahead and delete FlakyTest category but please NEVER go back to
marking flaky tests with @Ignore or renaming JUnit 3 test methods -- in my
opinion that was infinitely worse. The category was meant to be a
quarantine for FIXING not for IGNORING.

On Fri, Jul 6, 2018 at 7:26 AM, Anthony Baker <aba...@pivotal.io> wrote:

> Check [1] to see what ‘flaky’ tests have failed recently.
>
> Anthony
>
> [1] https://concourse.apachegeode-ci.info/teams/main/pipelines/
> develop-metrics/jobs/GeodeFlakyTestMetrics/builds/51 <https://concourse.
> apachegeode-ci.info/teams/main/pipelines/develop-metrics/jobs/
> GeodeFlakyTestMetrics/builds/51>
>
>
> > On Jul 6, 2018, at 6:56 AM, Jinmei Liao <jil...@pivotal.io> wrote:
> >
> > +1 for removing flaky category and fix as failure occurs.
> >
> > On Thu, Jul 5, 2018 at 8:21 PM Dan Smith <dsm...@pivotal.io> wrote:
> >
> >> Honestly I've never liked the flaky category. What it means is that at
> some
> >> point in the past, we decided to put off tracking down and fixing a
> failure
> >> and now we're left with a bug number and a description and that's it.
> >>
> >> I think we will be better off if we just get rid of the flaky category
> >> entirely. That way no one can label anything else as flaky and push it
> off
> >> for later, and if flaky tests fail again we will actually prioritize and
> >> fix them instead of ignoring them.
> >>
> >> I think Patrick was looking at rerunning the flaky tests to see what is
> >> still failing. How about we just run the whole flaky suite some number
> of
> >> times (100?), fix whatever is still failing and close out and remove the
> >> category from the rest?
> >>
> >> I think will we get more benefit from shaking out and fixing the issues
> we
> >> have in the current codebase than we will from carefully explaining the
> >> flaky failures from the past.
> >>
> >> -Dan
> >>
> >> On Thu, Jul 5, 2018 at 7:03 PM, Dale Emery <dem...@pivotal.io> wrote:
> >>
> >>> Hi Alexander and all,
> >>>
> >>>> On Jul 5, 2018, at 5:11 PM, Alexander Murmann <amurm...@pivotal.io>
> >>> wrote:
> >>>>
> >>>> Hi everyone!
> >>>>
> >>>> Dan Smith started a discussion about shaking out more flaky DUnit
> >> tests.
> >>>> That's a great effort and I am happy it's happening.
> >>>>
> >>>> As a corollary to that conversation I wonder what the criteria should
> >> be
> >>>> for a test to not be considered flaky any longer and have the category
> >>>> removed.
> >>>>
> >>>> In general the bar should be fairly high. Even if a test only fails ~1
> >> in
> >>>> 500 runs that's still a problem given how many tests we have.
> >>>>
> >>>> I see two ends of the spectrum:
> >>>> 1. We have a good understanding why the test was flaky and think we
> >> fixed
> >>>> it.
> >>>> 2. We have a hard time reproducing the flaky behavior and have no good
> >>>> theory as to why the test might have shown flaky behavior.
> >>>>
> >>>> In the first case I'd suggest to run the test ~100 times to get a
> >> little
> >>>> more confidence that we fixed the flaky behavior and then remove the
> >>>> category.
> >>>
> >>> Here’s a test for case 1:
> >>>
> >>> If we really understand why it was flaky, we will be able to:
> >>>    - Identify the “faults”—the broken places in the code (whether
> system
> >>> code or test code).
> >>>    - Identify the exact conditions under which those faults led to the
> >>> failures we observed.
> >>>    - Explain how those faults, under those conditions. led to those
> >>> failures.
> >>>    - Run unit tests that exercise the code under those same conditions,
> >>> and demonstrate that
> >>>      the formerly broken code now does the right thing.
> >>>
> >>> If we’re lacking any of these things, I’d say we’re dealing with case
> 2.
> >>>
> >>>> The second case is a lot more problematic. How often do we want to run
> >> a
> >>>> test like that before we decide that it might have been fixed since we
> >>> last
> >>>> saw it happen? Anything else we could/should do to verify the test
> >>> deserves
> >>>> our trust again?
> >>>
> >>>
> >>> I would want a clear, compelling explanation of the failures we
> observed.
> >>>
> >>> Clear and compelling are subjective, of course. For me, clear and
> >>> compelling would include
> >>> descriptions of:
> >>>   - The faults in the code. What, specifically, was broken.
> >>>   - The specific conditions under which the code did the wrong thing.
> >>>   - How those faults, under those conditions, led to those failures.
> >>>   - How the fix either prevents those conditions, or causes the
> formerly
> >>> broken code to
> >>>     now do the right thing.
> >>>
> >>> Even if we don’t have all of these elements, we may have some of them.
> >>> That can help us
> >>> calibrate our confidence. But the elements work together. If we’re
> >> lacking
> >>> one, the others
> >>> are shaky, to some extent.
> >>>
> >>> The more elements are missing in our explanation, the more times I’d
> want
> >>> to run the test
> >>> before trusting it.
> >>>
> >>> Cheers,
> >>> Dale
> >>>
> >>> —
> >>> Dale Emery
> >>> dem...@pivotal.io
> >>>
> >>>
> >>
> >
> >
> > --
> > Cheers
> >
> > Jinmei
>
>

Reply via email to