For test failures that do not have a ticket for that particular stack
trace, you should re-trigger your pre-checkin. If the test fails again,
your change probably caused it to start failing, even if it doesn't seem
related (like the unit test ordering issue we had Friday before last), and
you would be expected to fix it before committing.
If the test passes your next run of tests the process is less established.
Ideally we would never encounter this case because it means we have checked
in flaky code at some point before your branch off of develop, which
StressNewTest should catch. We will probably find a few of these because
the StressNewTest job was silently failing for a little while. It is
working again, but we may have checked in some flaky tests during the time
it was down.

I propose that we follow the CIO process with these. The process would be
as follows:
1. Create a Jira ticket for the failure with links to the relevant
resources from the failing CI run. Include any evidence that you have
towards it being a flaky test that exists on develop and not just in your
branch, and evidence that it was not made flaky by your change.
2. See if that test file was changed recently. If it was, talk to the
person that changed it.
3. If it wasn't changed recently, post a link to gemfire-green-ci
4. Comment on your pull request with an update on the status of the failing
test in your run.

That is just an idea based on what we are doing for the develop pipeline.
It seems like we are having more failures in PR pipelines than we were
seeing a couple of weeks ago, and some tests that seem to only fail in the
PR pipelines, so it might be time to start tracking some of this stuff. My
main concern with this method is that it might pollute our backlog with
tickets that no one is ever going to look at.

On Mon, Nov 26, 2018 at 10:19 AM Kirk Lund <kl...@apache.org> wrote:

> I just saw SizingFlagDUnitTest fail in a precheckin but it passes on my
> branch when I run directly. I cannot find a Jira ticket for it. What's the
> new process for handling these flickering tests?
>
> See:
> https://concourse.apachegeode-ci.info/builds/17745
>
> Test failure stack:
> org.apache.geode.internal.cache.SizingFlagDUnitTest >
> testPRHeapLRUDeltaPutOnPrimary FAILED
>     org.apache.geode.test.dunit.RMIException: While invoking
> org.apache.geode.internal.cache.SizingFlagDUnitTest$12.run in VM 0 running
> on Host eb7aca4f2587 with 4 VMs
>         at org.apache.geode.test.dunit.VM.invoke(VM.java:433)
>         at org.apache.geode.test.dunit.VM.invoke(VM.java:402)
>         at org.apache.geode.test.dunit.VM.invoke(VM.java:361)
>         at
>
> org.apache.geode.internal.cache.SizingFlagDUnitTest.assertValueType(SizingFlagDUnitTest.java:793)
>         at
>
> org.apache.geode.internal.cache.SizingFlagDUnitTest.doPRDeltaTestLRU(SizingFlagDUnitTest.java:312)
>         at
>
> org.apache.geode.internal.cache.SizingFlagDUnitTest.testPRHeapLRUDeltaPutOnPrimary(SizingFlagDUnitTest.java:220)
>
>         Caused by:
>         org.apache.geode.cache.EntryNotFoundException: Entry not found for
> key 0
>             at
>
> org.apache.geode.internal.cache.LocalRegion.checkEntryNotFound(LocalRegion.java:2760)
>             at
>
> org.apache.geode.internal.cache.LocalRegion.nonTXbasicGetValueInVM(LocalRegion.java:3448)
>             at
>
> org.apache.geode.internal.cache.LocalRegionDataView.getValueInVM(LocalRegionDataView.java:105)
>             at
>
> org.apache.geode.internal.cache.LocalRegion.basicGetValueInVM(LocalRegion.java:3436)
>             at
>
> org.apache.geode.internal.cache.LocalRegion.getValueInVM(LocalRegion.java:3424)
>             at
>
> org.apache.geode.internal.cache.PartitionedRegionDataStore.getLocalValueInVM(PartitionedRegionDataStore.java:2775)
>             at
>
> org.apache.geode.internal.cache.PartitionedRegion.getValueInVM(PartitionedRegion.java:8786)
>             at
>
> org.apache.geode.internal.cache.SizingFlagDUnitTest$12.run(SizingFlagDUnitTest.java:797)
>

Reply via email to