jpountz commented on code in PR #14310: URL: https://github.com/apache/lucene/pull/14310#discussion_r1974354077
########## lucene/core/src/java/org/apache/lucene/search/package-info.java: ########## @@ -350,6 +350,40 @@ * * <a id="customQueriesExpert"></a> * + * <h3>Multi-stage retrieval pipelines</h3> + * + * <p>The above explains how to influence the score when evaluating all matches of the query. This + * is expensive by design since it applies to all matches of the query, which could be millions. In + * order to apply more sophisticated ranking logic, a good approach consists of having a retrieval + * pipeline that runs a simple candidate retrieval stage that retrieves e.g. 1,000 hits, followed by + * a more sophisticated reranking stage that reranks these 1,000 hits to select the best 100 hits + * among them. Since the number of hits that this retrieval stage needs to operate on is bounded, it + * allows it to be more sophisticated. + * + * <p>Lucene exposes reranking via the {@link org.apache.lucene.search.Rescorer} abstract class, + * which has two main sub-classes: + * + * <ul> + * <li>{@link org.apache.lucene.search.QueryRescorer}, to rescore using a query. For instance, the + * query string could be parsed as phrase query using {@link + * org.apache.lucene.util.QueryBuilder#createPhraseQuery} instead of a boolean query in order + * to help boost hits which also match the query string as a phrase. + * <li>{@link org.apache.lucene.search.SortRescorer}, to rescore using a {@link + * org.apache.lucene.search.Sort}. For instance, the best 1,000 hits by BM25 score may be + * sorted by descending popularity in order to compute the final top-100 hits. + * </ul> + * + * <h3>Top hits fusion</h3> + * + * <p>Sometimes, multiple retrieval pipelines may make sense, having their own pros and cons. A + * typical example would be a lexical retrieval pipeline, matching exactly what the user requested, + * and a semantic retrieval pipeline, matching documents that are closest to the user's query from a + * semantic perspective. Combining scores is hazardous as different retrieval pipelines often + * produce scores that not only have different ranges, but also different distributions within this + * range. A robust way of combining multiple retrieval pipelines consists of combining the top hits + * that they produce through their ranks rather than through their scores using reciprocal rank + * fusion. This is exposed via {@link org.apache.lucene.search.TopDocs#rrf(int, int, TopDocs[])}. Review Comment: Ah, thanks, I didn't even know that one could do this. I just checked the generated javadocs, they look good with these parameter names. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org