aoli-al opened a new issue, #13552: URL: https://github.com/apache/lucene/issues/13552
### Description I saw a flaky test, `TestIndexWriterWithThreads#testIOExceptionDuringWriteSegmentWithThreadsOnlyOnce` caused by concurrency issues recently: ``` MockDirectoryWrapper: cannot close: there are still 30 open files: {_1.tvd=1, _0_Lucene99_0.pos=1, _1.nvd=1, _0.nvd=1, _2.fdt=1, _2.tvx=1, _0_Lucene99_0.tim=1, _0_Lucene99_0.tip=1, _2.fdx=1, _2_Lucene99_0.doc=1, _2.nvd=1, _2.tvd=1, _0_Lucene90_0.dvd=1, _1.fdx=1, _1_Lucene99_0.doc=1, _1.fdt=1, _1.tvx=1, _2_Lucene99_0.pos=1, _1_Lucene90_0.dvd=1, _0_Lucene99_0.doc=1, _0.fdx=1, _2_Lucene99_0.tip=1, _0.fdt=1, _0.tvx=1, _2_Lucene99_0.tim=1, _0.tvd=1, _1_Lucene99_0.pos=1, _2_Lucene90_0.dvd=1, _1_Lucene99_0.tip=1, _1_Lucene99_0.tim=1} java.lang.RuntimeException: MockDirectoryWrapper: cannot close: there are still 30 open files: {_1.tvd=1, _0_Lucene99_0.pos=1, _1.nvd=1, _0.nvd=1, _2.fdt=1, _2.tvx=1, _0_Lucene99_0.tim=1, _0_Lucene99_0.tip=1, _2.fdx=1, _2_Lucene99_0.doc=1, _2.nvd=1, _2.tvd=1, _0_Lucene90_0.dvd=1, _1.fdx=1, _1_Lucene99_0.doc=1, _1.fdt=1, _1.tvx=1, _2_Lucene99_0.pos=1, _1_Lucene90_0.dvd=1, _0_Lucene99_0.doc=1, _0.fdx=1, _2_Lucene99_0.tip=1, _0.fdt=1, _0.tvx=1, _2_Lucene99_0.tim=1, _0.tvd=1, _1_Lucene99_0.pos=1, _2_Lucene90_0.dvd=1, _1_Lucene99_0.tip=1, _1_Lucene99_0.tim=1} at __randomizedtesting.SeedInfo.seed([4D88B1144F0B21A3:1E330D289AB0246]:0) at org.apache.lucene.tests.store.MockDirectoryWrapper.close(MockDirectoryWrapper.java:876) at org.apache.lucene.index.TestIndexWriterWithThreads._testMultipleThreadsFailure(TestIndexWriterWithThreads.java:349) at org.apache.lucene.index.TestIndexWriterWithThreads.testIOExceptionDuringWriteSegmentWithThreadsOnlyOnce(TestIndexWriterWithThreads.java:502) at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103) at java.base/java.lang.reflect.Method.invoke(Method.java:580) at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758) at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946) at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982) at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996) at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48) at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43) at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45) at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60) at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44) at org.junit.rules.RunRules.evaluate(RunRules.java:20) at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390) at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843) at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490) at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955) at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840) at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891) at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902) at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43) at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38) at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40) at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40) at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53) at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43) at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44) at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60) at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47) at org.junit.rules.RunRules.evaluate(RunRules.java:20) at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390) at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850) at java.base/java.lang.Thread.run(Thread.java:1583) Caused by: java.lang.RuntimeException: unclosed IndexInput: _2_Lucene99_0.tim at org.apache.lucene.tests.store.MockDirectoryWrapper.addFileHandle(MockDirectoryWrapper.java:783) at org.apache.lucene.tests.store.MockDirectoryWrapper.openInput(MockDirectoryWrapper.java:835) at org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsReader.<init>(Lucene90BlockTreeTermsReader.java:136) at org.apache.lucene.codecs.lucene99.Lucene99PostingsFormat.fieldsProducer(Lucene99PostingsFormat.java:428) at org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader.<init>(PerFieldPostingsFormat.java:328) at org.apache.lucene.codecs.perfield.PerFieldPostingsFormat.fieldsProducer(PerFieldPostingsFormat.java:390) at org.apache.lucene.index.SegmentCoreReaders.<init>(SegmentCoreReaders.java:98) at org.apache.lucene.index.SegmentReader.<init>(SegmentReader.java:95) at org.apache.lucene.index.ReadersAndUpdates.getReader(ReadersAndUpdates.java:179) at org.apache.lucene.index.BufferedUpdatesStream$SegmentState.<init>(BufferedUpdatesStream.java:286) at org.apache.lucene.index.IndexWriter.openSegmentStates(IndexWriter.java:6481) at org.apache.lucene.index.IndexWriter.forceApply(IndexWriter.java:6255) at org.apache.lucene.index.IndexWriter.tryApply(IndexWriter.java:6189) at org.apache.lucene.index.IndexWriter.lambda$publishFrozenUpdates$12(IndexWriter.java:2814) at org.apache.lucene.index.IndexWriter$EventQueue.processEventsInternal(IndexWriter.java:336) at org.apache.lucene.index.IndexWriter$EventQueue.processEvents(IndexWriter.java:325) at org.apache.lucene.index.IndexWriter.processEvents(IndexWriter.java:6032) at org.apache.lucene.index.IndexWriter.maybeProcessEvents(IndexWriter.java:6025) at org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1552) at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1843) at org.apache.lucene.index.TestIndexWriterWithThreads$IndexerThread.run(TestIndexWriterWithThreads.java:98) unclosed IndexInput: _2_Lucene99_0.tim java.lang.RuntimeException: unclosed IndexInput: _2_Lucene99_0.tim at org.apache.lucene.tests.store.MockDirectoryWrapper.addFileHandle(MockDirectoryWrapper.java:783) at org.apache.lucene.tests.store.MockDirectoryWrapper.openInput(MockDirectoryWrapper.java:835) at org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsReader.<init>(Lucene90BlockTreeTermsReader.java:136) at org.apache.lucene.codecs.lucene99.Lucene99PostingsFormat.fieldsProducer(Lucene99PostingsFormat.java:428) at org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader.<init>(PerFieldPostingsFormat.java:328) at org.apache.lucene.codecs.perfield.PerFieldPostingsFormat.fieldsProducer(PerFieldPostingsFormat.java:390) at org.apache.lucene.index.SegmentCoreReaders.<init>(SegmentCoreReaders.java:98) at org.apache.lucene.index.SegmentReader.<init>(SegmentReader.java:95) at org.apache.lucene.index.ReadersAndUpdates.getReader(ReadersAndUpdates.java:179) at org.apache.lucene.index.BufferedUpdatesStream$SegmentState.<init>(BufferedUpdatesStream.java:286) at org.apache.lucene.index.IndexWriter.openSegmentStates(IndexWriter.java:6481) at org.apache.lucene.index.IndexWriter.forceApply(IndexWriter.java:6255) at org.apache.lucene.index.IndexWriter.tryApply(IndexWriter.java:6189) at org.apache.lucene.index.IndexWriter.lambda$publishFrozenUpdates$12(IndexWriter.java:2814) at org.apache.lucene.index.IndexWriter$EventQueue.processEventsInternal(IndexWriter.java:336) at org.apache.lucene.index.IndexWriter$EventQueue.processEvents(IndexWriter.java:325) at org.apache.lucene.index.IndexWriter.processEvents(IndexWriter.java:6032) at org.apache.lucene.index.IndexWriter.maybeProcessEvents(IndexWriter.java:6025) at org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1552) at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1843) at org.apache.lucene.index.TestIndexWriterWithThreads$IndexerThread.run(TestIndexWriterWithThreads.java:98) NOTE: reproduce with: gradlew test --tests TestIndexWriterWithThreads.testIOExceptionDuringWriteSegmentWithThreadsOnlyOnce -Dtests.seed=4D88B1144F0B21A3 -Dtests.locale=gv-Latn-IM -Dtests.timezone=America/Noronha -Dtests.asserts=true -Dtests.file.encoding=UTF-8 NOTE: test params are: codec=LZ4WithPresetCompressingStoredFieldsData(storedFieldsFormat=Lucene90CompressingStoredFieldsFormat(compressionMode=BEST_SPEED, chunkSize=24, maxDocsPerChunk=9, blockShift=7), termVectorsFormat=Lucene90CompressingTermVectorsFormat(compressionMode=BEST_SPEED, chunkSize=24, maxDocsPerChunk=9, blockSize=7)), sim=Asserting(RandomSimilarity(queryNorm=true): {field=IB LL-D2}), locale=gv-Latn-IM, timezone=America/Noronha NOTE: Mac OS X 14.5 aarch64/Oracle Corporation 21.0.2 (64-bit)/cpus=1,threads=1,free=229604664,total=257425408 NOTE: All tests run in this JVM: [TestIndexWriterWithThreads] ``` I could not identify the root cause of the failure, but luckily, I found a way to make the failure more deterministic. The test is still flaky, and you may run it multiple times to see the failure. ``` diff --git a/lucene/core/src/java/org/apache/lucene/index/IndexWriter.java b/lucene/core/src/java/org/apache/lucene/index/IndexWriter.java index 44d8bee8460..ace128a6cf8 100644 --- a/lucene/core/src/java/org/apache/lucene/index/IndexWriter.java +++ b/lucene/core/src/java/org/apache/lucene/index/IndexWriter.java @@ -1560,6 +1560,9 @@ public class IndexWriter if (infoStream.isEnabled("IW")) { infoStream.message("IW", "hit exception updating document"); } + try { + Thread.sleep(1000); + } catch (Exception e) {} maybeCloseOnTragicEvent(); } } @@ -2498,6 +2501,9 @@ public class IndexWriter docWriter.close(); // mark it as closed first to prevent subsequent indexing actions/flushes assert !Thread.holdsLock(this) : "IndexWriter lock should never be hold when aborting"; + try { + Thread.sleep(5000); + } catch (Exception e) {} docWriter.abort(); // don't sync on IW here docWriter.flushControl.waitForFlush(); // wait for all concurrently running flushes publishFlushedSegments( @@ -4944,6 +4950,10 @@ public class IndexWriter assert merge.registerDone; assert merge.maxNumSegments == UNBOUNDED_MAX_MERGE_SEGMENTS || merge.maxNumSegments > 0; + try { + Thread.sleep(1000); + } catch (Exception e) { + } if (tragedy.get() != null) { throw new IllegalStateException( "this writer hit an unrecoverable error; cannot merge", tragedy.get()); diff --git a/lucene/core/src/test/org/apache/lucene/index/TestIndexWriterWithThreads.java b/lucene/core/src/test/org/apache/lucene/index/TestIndexWriterWithThreads.java index 65dc37b50e5..b90d93a117d 100644 --- a/lucene/core/src/test/org/apache/lucene/index/TestIndexWriterWithThreads.java +++ b/lucene/core/src/test/org/apache/lucene/index/TestIndexWriterWithThreads.java @@ -270,9 +270,9 @@ public class TestIndexWriterWithThreads extends LuceneTestCase { // failure to trigger an IOException public void _testMultipleThreadsFailure(MockDirectoryWrapper.Failure failure) throws Exception { - int NUM_THREADS = 3; + int NUM_THREADS = 1; - for (int iter = 0; iter < 2; iter++) { + for (int iter = 0; iter < 1; iter++) { if (VERBOSE) { System.out.println("TEST: iter=" + iter); } @@ -296,6 +296,9 @@ public class TestIndexWriterWithThreads extends LuceneTestCase { } syncStart.await(); + try { + Thread.sleep(500); + } catch (Exception e) {} dir.failOn(failure); failure.setDoFail(); ``` Some observations: - the failure only happens if Merge threads are created, which requires `dir.failOn(failure);` to be executed after IndexWriter thread calls `updateDocuments`. - The merge threads should call the `IndexWriter.shouldClose` method. This requires the writer thread to wait a bit before calling `maybeCloseOnTragicEvent` at https://github.com/apache/lucene/blob/3304b60c9cc2fd23531263ea9d95dc3e92c0b2ce/lucene/core/src/java/org/apache/lucene/index/IndexWriter.java#L1563 - This allows the merge thread to call `maybeCloseOnTragicEvent();` while the IW threads continue. - Finally, the testing thread closes the `MockDirectoryWrapper` while the Merge thread is still inside `maybeCloseOnTragicEvent`. This may be related to #10112, #10930, and #4890 ### Version and environment details My lucene commit: 33a4c1d8ef999902dacedde9c7f04a3c7e2e78c9 JDK version: openjdk 21.0.3 2024-04-16 OpenJDK Runtime Environment Homebrew (build 21.0.3) OpenJDK 64-Bit Server VM Homebrew (build 21.0.3, mixed mode, sharing) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org