Re: [I] Improve Lucene's I/O concurrency [lucene]

via GitHub Tue, 04 Jun 2024 13:57:55 -0700


sohami commented on issue #13179:
URL: https://github.com/apache/lucene/issues/13179#issuecomment-2148405028


   > @sohami I gave a try at a possible approach at #13450 in case you're 
curious.
   
   @jpountz Thanks for sharing this. Originally I was thinking the prefetch 
optimization only in collect phase but I am trying to understand if it can be 
used in iterators side of things as well. To understand better I am looking 
into `SortedNumericDocValuesRangeQuery` test to understand the flow when 
different iterators are involved.  
   
   So far my general understanding is all the scoring and collection of docs 
via Collectors happens in the method `DefaultBulkScorer::Score`. The lead 
`scorerIterator` in that could either be a standalone iterator or wrapper on 
multiple iterators or an `approximation` iterator when `TwoPhaseIterator` is 
non-null. These are then passed down to `scoreAll` or `scoreRange` (ignoring 
the `competitiveIterator` for now). In either of `scoreAll` or `scoreRange` we 
iterate over the lead `scorerIterator` to get the matching docs and then check 
if the doc matches the `TwoPhaseIterator` or not to make it eligible for 
collection via collectors. So I see following flows/cases: a) When only lead 
`scorerIterator` is present, b) When both lead `scorerIterator` and 
`TwoPhaseIterator` is present, c) the collect phase which happens over doc that 
scorers have found.
   
   Based on my above understanding, I am thinking below and would love your 
feedback
   
   1.  For case (a), when only single iterator is involved the `readAhead` 
mechanism can be useful. This is considering a single iterator will not know 
what next match is until it goes to the next doc.
   
   2. For case (b), we can potentially do combination of `readAhead` and 
`prefetch`. We can use `readAhead` on lead iterator and then buffer some of the 
matching docs from this lead iterator. Then before evaluating if these docs 
matches `TwoPhaseIterator` or not, we can perform prefetch on these buffered 
docs (via some `prepareMatches` mechanism on `TwoPhaseIterator`). Here we know 
which all docs will be used for evaluating matches on `TwoPhaseIterator`, so we 
should be able to prefetch data for those docs. Would like to understand more 
on your earlier feedback on this, as my understanding is collection will come 
afterwards.
   > maybe we buffer the next few doc IDs from the first-phase scorer and 
prefetch those
   
   >> FWIW this would break a few things, e.g. we have collectors that only 
compute the score when needed (e.g. when sorting by field then score). But if 
we need to buffer docs up-front, then we don't know at this point in time if 
scores are going to be needed or not, so we need to score more docs. Maybe it's 
still the right trade-off, I'm mostly pointing out that this would be a bigger 
trade-off than what we've done for prefetching until now.
   
   3. Before calling collect phase on collectors, we can first buffer up the 
matching docs. Ask collectors to trigger optional `prefetch` of the docs which 
will be passed to it for collection. These docs are the ones which was produced 
by scorers with or without TwoPhaseIterator in the mix.
   
   I think for scenarios like 2 and 3 above where we know exact doc matches, 
performing prefetch could be useful vs readAhead.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [I] Improve Lucene's I/O concurrency [lucene]

Reply via email to