jpountz commented on code in PR #14310:
URL: https://github.com/apache/lucene/pull/14310#discussion_r1974354077

##########
lucene/core/src/java/org/apache/lucene/search/package-info.java:
##########
@@ -350,6 +350,40 @@
  *
  * <a id="customQueriesExpert"></a>
  *
+ * <h3>Multi-stage retrieval pipelines</h3>
+ *
+ * <p>The above explains how to influence the score when evaluating all 
matches of the query. This
+ * is expensive by design since it applies to all matches of the query, which 
could be millions. In
+ * order to apply more sophisticated ranking logic, a good approach consists 
of having a retrieval
+ * pipeline that runs a simple candidate retrieval stage that retrieves e.g. 
1,000 hits, followed by
+ * a more sophisticated reranking stage that reranks these 1,000 hits to 
select the best 100 hits
+ * among them. Since the number of hits that this retrieval stage needs to 
operate on is bounded, it
+ * allows it to be more sophisticated.
+ *
+ * <p>Lucene exposes reranking via the {@link 
org.apache.lucene.search.Rescorer} abstract class,
+ * which has two main sub-classes:
+ *
+ * <ul>
+ *   <li>{@link org.apache.lucene.search.QueryRescorer}, to rescore using a 
query. For instance, the
+ *       query string could be parsed as phrase query using {@link
+ *       org.apache.lucene.util.QueryBuilder#createPhraseQuery} instead of a 
boolean query in order
+ *       to help boost hits which also match the query string as a phrase.
+ *   <li>{@link org.apache.lucene.search.SortRescorer}, to rescore using a 
{@link
+ *       org.apache.lucene.search.Sort}. For instance, the best 1,000 hits by 
BM25 score may be
+ *       sorted by descending popularity in order to compute the final top-100 
hits.
+ * </ul>
+ *
+ * <h3>Top hits fusion</h3>
+ *
+ * <p>Sometimes, multiple retrieval pipelines may make sense, having their own 
pros and cons. A
+ * typical example would be a lexical retrieval pipeline, matching exactly 
what the user requested,
+ * and a semantic retrieval pipeline, matching documents that are closest to 
the user's query from a
+ * semantic perspective. Combining scores is hazardous as different retrieval 
pipelines often
+ * produce scores that not only have different ranges, but also different 
distributions within this
+ * range. A robust way of combining multiple retrieval pipelines consists of 
combining the top hits
+ * that they produce through their ranks rather than through their scores 
using reciprocal rank
+ * fusion. This is exposed via {@link 
org.apache.lucene.search.TopDocs#rrf(int, int, TopDocs[])}.

Review Comment:
   Ah, thanks, I didn't even know that one could do this. I just checked the 
generated javadocs, they look good with these parameter names.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to