jpountz commented on code in PR #13364: URL: https://github.com/apache/lucene/pull/13364#discussion_r1599846740
########## lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99PostingsReader.java: ########## @@ -2049,6 +2074,44 @@ public long cost() { } } + private void seekAndPrefetchPostings(IndexInput docIn, IntBlockTermState state) + throws IOException { + if (docIn.getFilePointer() != state.docStartFP) { + // Don't prefetch if the input is already positioned at the right offset, which suggests that + // the caller is streaming the entire inverted index (e.g. for merging), let the read-ahead + // logic do its work instead. Note that this heuristic doesn't work for terms that have skip + // data, since skip data is stored after the last term, but handling all terms that have <128 + // docs is a good start already. + docIn.seek(state.docStartFP); + if (state.skipOffset < 0) { + // This postings list is very short as it doesn't have skip data, prefetch the page that + // holds the first byte of the postings list. + docIn.prefetch(1); + } else if (state.skipOffset <= MAX_POSTINGS_SIZE_FOR_FULL_PREFETCH) { + // This postings list is short as it fits on a few pages, prefetch it all, plus one byte to + // make sure to include some skip data. + docIn.prefetch(state.skipOffset + 1); Review Comment: This is trying to address your concern about the number of system calls by doing a single system call when the postings list is short instead of independently doing a madvise call for postings and skip data. Especially as short postings are less likely to amortize the cost of system calls as iterating all docs is fast CPU-wise? When there are multiple clauses in the same query, usually the dense clauses would consume a small percentage of their number of docs while the sparser clauses would consume a large percentage of their number of docs, likely hitting all pages. So faulting all pages made sense to me but I don't have a strong feeling either and I'm happy to look at different approaches. > if the postings are short enough that we are willing to fault them all in at once, why do we even index skip data at all? You still get the CPU savings in the case when data fits in the page cache. Plus skip data also records impacts, and there are many cases when having impacts on the sparser clauses is important to be able to skip more hits on the denser clauses. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org