Re: [I] Improve Lucene's I/O concurrency [lucene]

via GitHub Wed, 05 Jun 2024 01:57:19 -0700


jpountz commented on issue #13179:
URL: https://github.com/apache/lucene/issues/13179#issuecomment-2149257220


   > Then before evaluating if these docs matches TwoPhaseIterator or not, we 
can perform prefetch on these buffered docs (via some prepareMatches mechanism 
on TwoPhaseIterator).
   
   This can be done, but I'd note that this would be a significant change to 
our APIs since `TwoPhaseIterator` only supports verifying the current document 
that the approximation is on. It is not possible to buffer matching documents 
from the approximation, to then check them with the `TwoPhaseIterator`. This is 
similar to the point I was making in a previous comment about buffering 
documents in collectors, `Scorer#score` only supports scoring the current 
document that the scorer is positioned on, it is not possible to buffer several 
documents and then evaluate their scores in `TopScoreDocCollector` (without API 
changes).
   
   > via some prepareMatches mechanism on TwoPhaseIterator
   
   FWIW one thing that is on my mind is that both postings and doc values take 
in the order of 1 or 2 bytes per document. So even a query that matches 0.1% of 
docs, evenly distributed in the doc ID space, would still end up fetching all 
pages in practice. So a very smart prefetching may only perform better than 
naive prefetching in the following cases:
    - Queries that are _extremely_ sparse.
    - Queries whose matches are highly clustered in the doc ID spare, because 
of index sorting, recursive graph bisection or early termination.
   
   But then I'd still expect some naive readahead logic to perform ok in such 
cases. For the extremely sparse case, it would fetch up to X times too many 
pages where X is the number of pages that get read ahead. For reasonable values 
of X, this should be ok.
   
   The other thing that is on my mind is that this sort of approach allows us 
doing it completely at the OS level, which gives additional efficiency.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [I] Improve Lucene's I/O concurrency [lucene]

Reply via email to