Check [1] to see what ‘flaky’ tests have failed recently.

Anthony

[1] 
https://concourse.apachegeode-ci.info/teams/main/pipelines/develop-metrics/jobs/GeodeFlakyTestMetrics/builds/51
 
<https://concourse.apachegeode-ci.info/teams/main/pipelines/develop-metrics/jobs/GeodeFlakyTestMetrics/builds/51>


> On Jul 6, 2018, at 6:56 AM, Jinmei Liao <jil...@pivotal.io> wrote:
> 
> +1 for removing flaky category and fix as failure occurs.
> 
> On Thu, Jul 5, 2018 at 8:21 PM Dan Smith <dsm...@pivotal.io> wrote:
> 
>> Honestly I've never liked the flaky category. What it means is that at some
>> point in the past, we decided to put off tracking down and fixing a failure
>> and now we're left with a bug number and a description and that's it.
>> 
>> I think we will be better off if we just get rid of the flaky category
>> entirely. That way no one can label anything else as flaky and push it off
>> for later, and if flaky tests fail again we will actually prioritize and
>> fix them instead of ignoring them.
>> 
>> I think Patrick was looking at rerunning the flaky tests to see what is
>> still failing. How about we just run the whole flaky suite some number of
>> times (100?), fix whatever is still failing and close out and remove the
>> category from the rest?
>> 
>> I think will we get more benefit from shaking out and fixing the issues we
>> have in the current codebase than we will from carefully explaining the
>> flaky failures from the past.
>> 
>> -Dan
>> 
>> On Thu, Jul 5, 2018 at 7:03 PM, Dale Emery <dem...@pivotal.io> wrote:
>> 
>>> Hi Alexander and all,
>>> 
>>>> On Jul 5, 2018, at 5:11 PM, Alexander Murmann <amurm...@pivotal.io>
>>> wrote:
>>>> 
>>>> Hi everyone!
>>>> 
>>>> Dan Smith started a discussion about shaking out more flaky DUnit
>> tests.
>>>> That's a great effort and I am happy it's happening.
>>>> 
>>>> As a corollary to that conversation I wonder what the criteria should
>> be
>>>> for a test to not be considered flaky any longer and have the category
>>>> removed.
>>>> 
>>>> In general the bar should be fairly high. Even if a test only fails ~1
>> in
>>>> 500 runs that's still a problem given how many tests we have.
>>>> 
>>>> I see two ends of the spectrum:
>>>> 1. We have a good understanding why the test was flaky and think we
>> fixed
>>>> it.
>>>> 2. We have a hard time reproducing the flaky behavior and have no good
>>>> theory as to why the test might have shown flaky behavior.
>>>> 
>>>> In the first case I'd suggest to run the test ~100 times to get a
>> little
>>>> more confidence that we fixed the flaky behavior and then remove the
>>>> category.
>>> 
>>> Here’s a test for case 1:
>>> 
>>> If we really understand why it was flaky, we will be able to:
>>>    - Identify the “faults”—the broken places in the code (whether system
>>> code or test code).
>>>    - Identify the exact conditions under which those faults led to the
>>> failures we observed.
>>>    - Explain how those faults, under those conditions. led to those
>>> failures.
>>>    - Run unit tests that exercise the code under those same conditions,
>>> and demonstrate that
>>>      the formerly broken code now does the right thing.
>>> 
>>> If we’re lacking any of these things, I’d say we’re dealing with case 2.
>>> 
>>>> The second case is a lot more problematic. How often do we want to run
>> a
>>>> test like that before we decide that it might have been fixed since we
>>> last
>>>> saw it happen? Anything else we could/should do to verify the test
>>> deserves
>>>> our trust again?
>>> 
>>> 
>>> I would want a clear, compelling explanation of the failures we observed.
>>> 
>>> Clear and compelling are subjective, of course. For me, clear and
>>> compelling would include
>>> descriptions of:
>>>   - The faults in the code. What, specifically, was broken.
>>>   - The specific conditions under which the code did the wrong thing.
>>>   - How those faults, under those conditions, led to those failures.
>>>   - How the fix either prevents those conditions, or causes the formerly
>>> broken code to
>>>     now do the right thing.
>>> 
>>> Even if we don’t have all of these elements, we may have some of them.
>>> That can help us
>>> calibrate our confidence. But the elements work together. If we’re
>> lacking
>>> one, the others
>>> are shaky, to some extent.
>>> 
>>> The more elements are missing in our explanation, the more times I’d want
>>> to run the test
>>> before trusting it.
>>> 
>>> Cheers,
>>> Dale
>>> 
>>> —
>>> Dale Emery
>>> dem...@pivotal.io
>>> 
>>> 
>> 
> 
> 
> -- 
> Cheers
> 
> Jinmei

Reply via email to