vigyasharma opened a new pull request, #14729: URL: https://github.com/apache/lucene/pull/14729
Late Interaction models, like [ColBERT](https://arxiv.org/abs/2004.12832) and [ColPali](https://arxiv.org/html/2407.01449v2), capture rich semantic interaction between documents and queries, and have been shown to outperform single-vector (no-interaction) models on search relevance. These models operate by using multi-vector representations for query (and document) embeddings. One challenge with including late interaction models in search, has been working with multi-vectors at scale. This change provides an efficient workaround, by adding support to rerank results of a query using late interaction multi-vectors. Typical envisioned use-case is to do the full corpus search using ANN search on single-valued vectors, followed by a second pass that reranks results using late-interaction multi-vector scores. This PR creates: 1. A LateInteractionField that stores multi-vectors in BinaryDocValues 2. A DoubleValuesSource to scores query and document multi-vectors. 3. A FunctionScore query that wraps a provided query and reranks its result with late-interaction model scores. Note: This first approach does not add additional metadata to `FieldInfo`. As a result, we are unable to ensure consistency in shape for multi-vector indexed in the same field across documents. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org