Re: [I] Improve Lucene's I/O concurrency [lucene]

via GitHub Wed, 05 Jun 2024 19:44:30 -0700


sohami commented on issue #13179:
URL: https://github.com/apache/lucene/issues/13179#issuecomment-2151313696


   > > Then before evaluating if these docs matches TwoPhaseIterator or not, we 
can perform prefetch on these buffered docs (via some prepareMatches mechanism 
on TwoPhaseIterator).
   > 
   > This can be done, but I'd note that this would be a significant change to 
our APIs since `TwoPhaseIterator` only supports verifying the current document 
that the approximation is on. It is not possible to buffer matching documents 
from the approximation, to then check them with the `TwoPhaseIterator`. This is 
similar to the point I was making in a previous comment about buffering 
documents in collectors, `Scorer#score` only supports scoring the current 
document that the scorer is positioned on, it is not possible to buffer several 
documents and then evaluate their scores in `TopScoreDocCollector` (without API 
changes).
   
   Thanks for explaining this to me. Seems like this would mean to change the 
iteration and scoring behavior to work on range of docs vs 1 doc at a time 
(which is the current behavior in lucene). Probably it will work fine for 
collectors not requiring any scoring but it is not a general use case and will 
be limited to exact prefetch in collectors only.
   
   > > via some prepareMatches mechanism on TwoPhaseIterator
   > 
   > FWIW one thing that is on my mind is that both postings and doc values 
take in the order of 1 or 2 bytes per document. So even a query that matches 
0.1% of docs, evenly distributed in the doc ID space, would still end up 
fetching all pages in practice. So a very smart prefetching may only perform 
better than naive prefetching in the following cases:
   > 
   > * Queries that are _extremely_ sparse.
   > * Queries whose matches are highly clustered in the doc ID spare, because 
of index sorting, recursive graph bisection or early termination.
   > 
   > But then I'd still expect some naive readahead logic to perform ok in such 
cases. For the extremely sparse case, it would fetch up to X times too many 
pages where X is the number of pages that get read ahead. For reasonable values 
of X, this should be ok.
   > 
   
   Agreed. I had similar thought with readahead but was trying to see if there 
are ways to avoid X. But as you said in general cases it will probably end up 
fetching all the pages anyways.
   
   For read ahead in DocValues case, I am thinking that when docValue is 
fetched for current doc, probably we can provide the hint there to the 
IndexInput to perform readahead. This can be useful for IndexInputs to perform 
some read ahead which interacts with remote store. However, in default case, I 
think the OS will take care of readahead on the read from a specific offset so 
it could be NoOp there. But same could be done even when any seek is happening 
on an IndexInput if it knows that it should follow the sequential access 
pattern. So I guess your latest 
[PR](https://github.com/apache/lucene/pull/13450) is providing that hint and 
probably we don't need any separate `readAhead` API.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [I] Improve Lucene's I/O concurrency [lucene]

Reply via email to