vigyasharma opened a new pull request, #14173:
URL: https://github.com/apache/lucene/pull/14173

   Another take at #12313 
   
   The following PR adds support for _independent_ multi-vectors, i.e. 
scenarios where a single document is represented by multiple independent vector 
values. The most common example for this, is the passage-vector search use 
case, where we create a vector for every paragraph chunk in the document. 
   
   Currently, Lucene only supports a single vector per document. Users are 
required to create parent-child relationships with the _chunked_ vectors of a 
document, and call a `ParentBlockJoin` query to run passage vector search. This 
change allows indexing multiple vectors within the same document. 
   
   Each vector is still assigned a unique int ordinal but multiple ordinals can 
now map to the same document. We use additional metadata to maintain the 
many-one ordToDoc mapping, and also quickly figure out the first indexed vector 
ordinal for a document (called baseOrdinal `(baseOrd)`). This gives us new APIs 
that fetch all vectors for a document, which can be used for faster scoring (as 
opposed to the child doc query in ParentJoin approach):
   ```java
   // iterator on vector values for the doc corresponding to provided ord
   public Iterator<float[]> allVectorValues(int ord);
   
   // simpler API, returns iterator on all vectors for doc corresponding to 
given base ord.
   public Iterator<float[]> allVectorValues(int baseOrd, int ordCount); 
   
   // ... same APIs for ByteVectorValues
   ```
   
   #### Interface
   The interface to use multi-vector values is quite simple now:
   ```java
   // indexing
   Document doc = new Document();
   doc.add(vector1);
   doc.add(vector2);
   ...
   doc.add(vectorN);
   iw.addDocument(doc);
   
   // query
   KnnFloatMultiVectorQuery query = new KnnFloatMultiVectorQuery(field, target, 
k);
   searcher.search(query, k);
   ```
   
   I was able to add a multi-vector benchmark to luceneutil to run this setup 
end to end. Will link results and a luceneutil PR in comments.
   
   
   #### Pending Tasks:
   This is an early draft to get some feedback, I have TODOs across the code 
for future improvements. Here are some big items pending:
   
   - [ ] Backward compatibility for the storage format
   - [ ] New version for vector storage format (Lucene 111)?
   - [ ] Support for merging on multi vector values
   - [ ] Optimization for single-valued vectors (store less metadata)
   - [ ] Support for scoring based on all vectors of a document (e.g. 
`ScoreMode.Avg`)
   - [ ] Unit tests
   - [ ] Support for multi-valued vectors in quantized vectors.
   
   __
   **Note:** This change does not include _dependent_ multi-valued vectors like 
ColBERT, where the multiple vectors must used together to compute similarity. 
It does however lay essential ground work which can subsequently be extended 
for this support.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to