[PR] WIP: draft of intra segment concurrency [lucene]

via GitHub Thu, 04 Jul 2024 07:47:03 -0700


javanna opened a new pull request, #13542:
URL: https://github.com/apache/lucene/pull/13542


   I experimented trying to introduce intra-segment concurrency in Lucene, by 
leveraging the existing `Scorer#score` method that takes a range of id as 
argument, and switching the searcher to call that to search a partition of a 
leaf. I introduced a `LeafReaderContextPartition` abstraction that wraps 
`LeafReaderContext` and the range of doc ids that the partition targets. A 
slice now points to specific partitions of segments, identified by the range of 
doc ids.
   
   I have found a couple of challenging problems (thanks to test failures) that 
I have solved / worked around the best ways I could at the time, but I am sure 
the current solutions are not yet good enough to merge. They work but need 
refining. They are good to highlight the problems I encountered:
   
   1) IndexSearcher#count / TotalHitCountCollector rely on 
`Weight#count(LeafReaderContext)`, which now gets called multiple times against 
the same leaf and leads to excessive counting of hits.
   Resolved by synchronizing the call to `getLeafCollector`, to make sure we 
don't pull a leaf collector for the same segment multiple times in parallel, 
and mark a leaf early terminated when `getLeafCollector` throws a 
`CollectionTerminatedException` so that we skip that same leaf in subsequent 
calls that refer to it.
   
   2) LRUQueryCache caches the return value of `Weight#count`. When we execute 
the same query against the same segment multiple times (as part of a single 
search call), the first time we do the actual counting for the docs that the 
first partition holds, and subsequent times we should do the same, count hits 
in each partition of the same segment instead of retrieving the count from the 
cache. The right thing to do is to only get from the cache the first time we 
execute against a certain segment as part of a certain search operation. 
Subsequent executions within the same search, that target the same segment 
should not go through the cache. If we did not go through the cache for the 
first occurrence of a certain segment, it was not cached. This is dealt with in 
LRUQueryCache.CachingWrapperWeight#count for now, keeping track of which leaf 
contexts are seen.
   
   3) CheckHits verifies matches and may go outside of the bounds of the doc id 
range of the current slice
   
   
   These are more or less the changes I made step by step:
   - Added LeafReaderContextPartition abstraction that holds LeafReaderContext 
+ the range of doc ids it targets. A slice points now to specific subsets of 
segments, identified by their corresponding range of doc ids.
   - Introduced additional IndexSearcher#search method that takes 
LeafReaderContextPartition[] instead of LeafReaderContext[], which calls 
`scorer.score(leafCollector, ctx.reader().getLiveDocs(), slice.minDocId, 
slice.maxDocId);` providing the range of doc ids in place of 
`scorer.score(leafCollector, ctx.reader().getLiveDocs());` that would score all 
documents
   - Added override for new protected IndexSearcher#search to subclasses that 
require it
   - Introduced wrapping of `LeafReaderContext` at the beginning of each 
search, to share state between slices that target the same segment (to solve 
early termination issues described below)
   - hacked IndexSearcher#getSlices and LuceneTestCase to generate as many 
slices as possible, as well as set an executor as often as possible in tests, 
in order to get test coverage and find issues
   
   
   There's still a couple of rough edges, some test failures that need to be 
investigated to see if it's test problems or actual bugs. I hacked the way we 
generate slices to produce single document slices, but that produces way too 
many slices in some situations which causes test failures due to OOM, or too 
much cloning happening.
   
   
   
   I am looking for early feedback: does the technical approach make sense? 
Would you do it entirely differently? How do we solve the problems I 
encountered described above? Can I get help on tests that need investigation?
   
   Relates to #9721
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[PR] WIP: draft of intra segment concurrency [lucene]

Reply via email to