uschindler opened a new issue, #15106: URL: https://github.com/apache/lucene/issues/15106
### Description One user reported and I was able to explain this behaviour: https://github.com/apache/lucene/issues/15068#issuecomment-3210953067 Basically the RefCountedSharedArena throws an `IllegalStateException` when the refcounting is wrong. Tis should only happen when something is wrong, like you close an arena too often. But due to the official API of Arena, throwing IllegalStateException is the worst thing to do, because the documentation exactly specifies when this may happen: > Throws: > - [IllegalStateException](https://docs.oracle.com/en/java/javase/24/docs/api/java.base/java/lang/IllegalStateException.html) - if the arena has already been closed > - [IllegalStateException](https://docs.oracle.com/en/java/javase/24/docs/api/java.base/java/lang/IllegalStateException.html) - if a segment associated with this arena is being accessed concurrently, e.g. by a downcall method handle .... _addition by Uwe:_ it is enough if any code actually invokes a method on a memory segment in that shared arena while another thread is calling close! Throwing IllegalStateException is perfectly fine theoretically, if and only if the underlying `Scope` is already closed. But we don't check this at all and just throw the exception if the close method was called to often. If now the second documented `IllegalStateException` is thrown by the Arena we delegate to, the refcount was already decremented. Now the "official" way to handle the second case is this spinloop, which is implemented in `MemorySegmentIndexInput.close()`: https://github.com/apache/lucene/blob/cb16075c5fd5c4bbcb0fab561e185fe2e3f8e939/lucene/core/src/java/org/apache/lucene/store/MemorySegmentIndexInput.java#L716-L724 This spinloop checks if the arena's scope is still open and if that is still the case it repeats the close call on `IllegalStateException` _until_ the arena's scope returns closed. This code was negotiated with the OpenJDK list and we also have a multithreaded test for it. Of course this test does not use RefCountedArenas as it uses a test file with a name which is not grouped. Normal case: - MemorySegmentIndexInput code calls close for first time (inner arenas scope still open). If the arena is still open, all is fine and refcount is decremented and inner arena is not touched. All fine What now happens if the inner Arena's close throws that `IllegalStateException`: - MemorySegmentIndexInput calls close on the last open file in that shared group. - The refcount is lowered, no exception thrown - **For bad luck, some other thread acceses an IndexInput** - The inner Arena's close throws IllegalStateException in the final cleanup - MemorySegmentIndexInput (above code) catches exception and enters spinloop - MemorySegmentIndexInput repeats close - The refcount is now already 0 and therefor the RefCountedSharedArena's code throws IllegalStateException (while inner Arena's close is no longer called at all). - MemorySegmentIndexInput (above code) catches exception and enters spinloop - and this repeats forever, the inner arena is never closed Basically to fix this, the following should be done: - The code here should be removed: https://github.com/apache/lucene/blob/b1c1770e8052a74fd30e3520e4fd662e2724bd1c/lucene/core/src/java/org/apache/lucene/store/RefCountedSharedArena.java#L106-L108 - Instead if the refcount reaches 0 it should stay 0 (or go negative, no matter). - The code should run the cleanup again and therefore call inner Arena's close again. If this arena throws IllegalStateException it can be handled by the caller. So basically I'd remove the two parts of the refocunting and change the code to trivially do: - decrement refcount - if refcount>0, do nothing - otherwise call inner Arena's close If for some reason the inner scope is already closed the Arena#close() will throw IllegalState anyways, we do not need to duplicate the refcounting! Severity: The bug is severe, if and only if somebody closes an IndexInput/IndexReader while it is used in another thread. This may lead to a spinloop if the concurrent acces is activcely accessing a segment.. SearcherMananger and normal Lucene use should not be affected, but the bug may still causehalting the JVM due to a infinite loop when something goes wrong on close! @ChrisHegarty ### Version and environment details Affected versions: Lucene 9.12, 10.x, main branch -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org