javanna opened a new pull request, #13542: URL: https://github.com/apache/lucene/pull/13542
I experimented trying to introduce intra-segment concurrency in Lucene, by leveraging the existing `Scorer#score` method that takes a range of id as argument, and switching the searcher to call that to search a partition of a leaf. I introduced a `LeafReaderContextPartition` abstraction that wraps `LeafReaderContext` and the range of doc ids that the partition targets. A slice now points to specific partitions of segments, identified by the range of doc ids. I have found a couple of challenging problems (thanks to test failures) that I have solved / worked around the best ways I could at the time, but I am sure the current solutions are not yet good enough to merge. They work but need refining. They are good to highlight the problems I encountered: 1) IndexSearcher#count / TotalHitCountCollector rely on `Weight#count(LeafReaderContext)`, which now gets called multiple times against the same leaf and leads to excessive counting of hits. Resolved by synchronizing the call to `getLeafCollector`, to make sure we don't pull a leaf collector for the same segment multiple times in parallel, and mark a leaf early terminated when `getLeafCollector` throws a `CollectionTerminatedException` so that we skip that same leaf in subsequent calls that refer to it. 2) LRUQueryCache caches the return value of `Weight#count`. When we execute the same query against the same segment multiple times (as part of a single search call), the first time we do the actual counting for the docs that the first partition holds, and subsequent times we should do the same, count hits in each partition of the same segment instead of retrieving the count from the cache. The right thing to do is to only get from the cache the first time we execute against a certain segment as part of a certain search operation. Subsequent executions within the same search, that target the same segment should not go through the cache. If we did not go through the cache for the first occurrence of a certain segment, it was not cached. This is dealt with in LRUQueryCache.CachingWrapperWeight#count for now, keeping track of which leaf contexts are seen. 3) CheckHits verifies matches and may go outside of the bounds of the doc id range of the current slice These are more or less the changes I made step by step: - Added LeafReaderContextPartition abstraction that holds LeafReaderContext + the range of doc ids it targets. A slice points now to specific subsets of segments, identified by their corresponding range of doc ids. - Introduced additional IndexSearcher#search method that takes LeafReaderContextPartition[] instead of LeafReaderContext[], which calls `scorer.score(leafCollector, ctx.reader().getLiveDocs(), slice.minDocId, slice.maxDocId);` providing the range of doc ids in place of `scorer.score(leafCollector, ctx.reader().getLiveDocs());` that would score all documents - Added override for new protected IndexSearcher#search to subclasses that require it - Introduced wrapping of `LeafReaderContext` at the beginning of each search, to share state between slices that target the same segment (to solve early termination issues described below) - hacked IndexSearcher#getSlices and LuceneTestCase to generate as many slices as possible, as well as set an executor as often as possible in tests, in order to get test coverage and find issues There's still a couple of rough edges, some test failures that need to be investigated to see if it's test problems or actual bugs. I hacked the way we generate slices to produce single document slices, but that produces way too many slices in some situations which causes test failures due to OOM, or too much cloning happening. I am looking for early feedback: does the technical approach make sense? Would you do it entirely differently? How do we solve the problems I encountered described above? Can I get help on tests that need investigation? Relates to #9721 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org