For test failures that do not have a ticket for that particular stack trace, you should re-trigger your pre-checkin. If the test fails again, your change probably caused it to start failing, even if it doesn't seem related (like the unit test ordering issue we had Friday before last), and you would be expected to fix it before committing. If the test passes your next run of tests the process is less established. Ideally we would never encounter this case because it means we have checked in flaky code at some point before your branch off of develop, which StressNewTest should catch. We will probably find a few of these because the StressNewTest job was silently failing for a little while. It is working again, but we may have checked in some flaky tests during the time it was down.
I propose that we follow the CIO process with these. The process would be as follows: 1. Create a Jira ticket for the failure with links to the relevant resources from the failing CI run. Include any evidence that you have towards it being a flaky test that exists on develop and not just in your branch, and evidence that it was not made flaky by your change. 2. See if that test file was changed recently. If it was, talk to the person that changed it. 3. If it wasn't changed recently, post a link to gemfire-green-ci 4. Comment on your pull request with an update on the status of the failing test in your run. That is just an idea based on what we are doing for the develop pipeline. It seems like we are having more failures in PR pipelines than we were seeing a couple of weeks ago, and some tests that seem to only fail in the PR pipelines, so it might be time to start tracking some of this stuff. My main concern with this method is that it might pollute our backlog with tickets that no one is ever going to look at. On Mon, Nov 26, 2018 at 10:19 AM Kirk Lund <kl...@apache.org> wrote: > I just saw SizingFlagDUnitTest fail in a precheckin but it passes on my > branch when I run directly. I cannot find a Jira ticket for it. What's the > new process for handling these flickering tests? > > See: > https://concourse.apachegeode-ci.info/builds/17745 > > Test failure stack: > org.apache.geode.internal.cache.SizingFlagDUnitTest > > testPRHeapLRUDeltaPutOnPrimary FAILED > org.apache.geode.test.dunit.RMIException: While invoking > org.apache.geode.internal.cache.SizingFlagDUnitTest$12.run in VM 0 running > on Host eb7aca4f2587 with 4 VMs > at org.apache.geode.test.dunit.VM.invoke(VM.java:433) > at org.apache.geode.test.dunit.VM.invoke(VM.java:402) > at org.apache.geode.test.dunit.VM.invoke(VM.java:361) > at > > org.apache.geode.internal.cache.SizingFlagDUnitTest.assertValueType(SizingFlagDUnitTest.java:793) > at > > org.apache.geode.internal.cache.SizingFlagDUnitTest.doPRDeltaTestLRU(SizingFlagDUnitTest.java:312) > at > > org.apache.geode.internal.cache.SizingFlagDUnitTest.testPRHeapLRUDeltaPutOnPrimary(SizingFlagDUnitTest.java:220) > > Caused by: > org.apache.geode.cache.EntryNotFoundException: Entry not found for > key 0 > at > > org.apache.geode.internal.cache.LocalRegion.checkEntryNotFound(LocalRegion.java:2760) > at > > org.apache.geode.internal.cache.LocalRegion.nonTXbasicGetValueInVM(LocalRegion.java:3448) > at > > org.apache.geode.internal.cache.LocalRegionDataView.getValueInVM(LocalRegionDataView.java:105) > at > > org.apache.geode.internal.cache.LocalRegion.basicGetValueInVM(LocalRegion.java:3436) > at > > org.apache.geode.internal.cache.LocalRegion.getValueInVM(LocalRegion.java:3424) > at > > org.apache.geode.internal.cache.PartitionedRegionDataStore.getLocalValueInVM(PartitionedRegionDataStore.java:2775) > at > > org.apache.geode.internal.cache.PartitionedRegion.getValueInVM(PartitionedRegion.java:8786) > at > > org.apache.geode.internal.cache.SizingFlagDUnitTest$12.run(SizingFlagDUnitTest.java:797) >