sgup432 opened a new issue, #14999: URL: https://github.com/apache/lucene/issues/14999
### Description During a recent issue in OpenSearch, we discovered high wait time(>50ms) for `IndexReader.registerParentReader` for both query and fetch phase. During this time, OpenSearchLeafReader was having more than >30k parentReader, and with many leafReaders(aka segments) per lucene index, this caused significant latency for a search request(especially tail latencies). Each IndexReader appears to tracks its parents using a synchronized set that holds weak references. These weak references are cleared during subsequent synchronizedSet.add calls if the GC has already reclaimed the associated objects. During the time of this issue, the set had grown beyond > 30k entries, and one thread was already holding a lock to clean up stale references. As a result, other threads attempting to call `SynchronizetSet.add` were blocked, leading to contention and delay. A sample lock profile looked like below. ~33% of the time being spent in trying to add elements to the set. <img width="1486" height="154" alt="Image" src="https://github.com/user-attachments/assets/3d8380c4-e1e7-4133-8ed7-7d7bfde92636" /> **More context**: In OpenSearch, each query and fetch phase maintains a per-request context by wrapping the underlying reader in a NonClosingReaderWrapper. This ensures that plugins can safely wrap the reader without risking accidental closure of the underlying resource. For every request on a shard, multiple short-lived NonClosingReaderWrapper instances (along with other plugin-provided IndexReaderWrappers) are created and registered as parents of the OpenSearchLeafReader. This leads to lock contention in `synchronizedSet.add`, especially when stale entries in the parentReaders set are being cleared. Under such conditions, lock wait times can exceed 100ms, depending on the size of the parentReaders set. The issue becomes more pronounced under high throughput per shard, where the frequency of reader registration increases significantly. **Is it a lucene issue?** I understand that one could argue this is primarily an OpenSearch issue, specifically how it creates and manages per-request reader wrappers. However, I believe Lucene should offer a basic mechanism at the IndexReader level to exclude certain instances from being tracked as parents, or alternatively, improve its current approach. The existing use of a synchronizedSet with weak references can lead to performance bottlenecks in such cases. Curious to know what lucene folks think about this. **Some solutions I can think of are below** Open to any other ideas. 1. Have a way to not track certain index readers as parents. Any short lived readerWrapper can then have trackAsParent as false and avoid being tracked. I did a simple implementation of this, and it improved p99 upto ~30-40% in some cases. ``` class IndexReader { public boolean trackAsParent() { return true; } public final void registerParentReader(IndexReader reader) { ensureOpen(); if (reader.trackAsParent()) { parentReaders.add(reader); } } } ``` 2. If not above, consider making `registerParentReader` non-final so its behavior can be overriden? 3. Or maybe provide an explicit way to remove parentReader from the set. So that once short loved readerWrapper are closed, we can explicitly remove them immediately and not allow synchronizedSet to grow in size. This still might have some contention, but will have to test it out. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org