sohami commented on issue #13179: URL: https://github.com/apache/lucene/issues/13179#issuecomment-2151313696
> > Then before evaluating if these docs matches TwoPhaseIterator or not, we can perform prefetch on these buffered docs (via some prepareMatches mechanism on TwoPhaseIterator). > > This can be done, but I'd note that this would be a significant change to our APIs since `TwoPhaseIterator` only supports verifying the current document that the approximation is on. It is not possible to buffer matching documents from the approximation, to then check them with the `TwoPhaseIterator`. This is similar to the point I was making in a previous comment about buffering documents in collectors, `Scorer#score` only supports scoring the current document that the scorer is positioned on, it is not possible to buffer several documents and then evaluate their scores in `TopScoreDocCollector` (without API changes). Thanks for explaining this to me. Seems like this would mean to change the iteration and scoring behavior to work on range of docs vs 1 doc at a time (which is the current behavior in lucene). Probably it will work fine for collectors not requiring any scoring but it is not a general use case and will be limited to exact prefetch in collectors only. > > via some prepareMatches mechanism on TwoPhaseIterator > > FWIW one thing that is on my mind is that both postings and doc values take in the order of 1 or 2 bytes per document. So even a query that matches 0.1% of docs, evenly distributed in the doc ID space, would still end up fetching all pages in practice. So a very smart prefetching may only perform better than naive prefetching in the following cases: > > * Queries that are _extremely_ sparse. > * Queries whose matches are highly clustered in the doc ID spare, because of index sorting, recursive graph bisection or early termination. > > But then I'd still expect some naive readahead logic to perform ok in such cases. For the extremely sparse case, it would fetch up to X times too many pages where X is the number of pages that get read ahead. For reasonable values of X, this should be ok. > Agreed. I had similar thought with readahead but was trying to see if there are ways to avoid X. But as you said in general cases it will probably end up fetching all the pages anyways. For read ahead in DocValues case, I am thinking that when docValue is fetched for current doc, probably we can provide the hint there to the IndexInput to perform readahead. This can be useful for IndexInputs to perform some read ahead which interacts with remote store. However, in default case, I think the OS will take care of readahead on the read from a specific offset so it could be NoOp there. But same could be done even when any seek is happening on an IndexInput if it knows that it should follow the sequential access pattern. So I guess your latest [PR](https://github.com/apache/lucene/pull/13450) is providing that hint and probably we don't need any separate `readAhead` API. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org