vigyasharma opened a new pull request, #14173: URL: https://github.com/apache/lucene/pull/14173
Another take at #12313 The following PR adds support for _independent_ multi-vectors, i.e. scenarios where a single document is represented by multiple independent vector values. The most common example for this, is the passage-vector search use case, where we create a vector for every paragraph chunk in the document. Currently, Lucene only supports a single vector per document. Users are required to create parent-child relationships with the _chunked_ vectors of a document, and call a `ParentBlockJoin` query to run passage vector search. This change allows indexing multiple vectors within the same document. Each vector is still assigned a unique int ordinal but multiple ordinals can now map to the same document. We use additional metadata to maintain the many-one ordToDoc mapping, and also quickly figure out the first indexed vector ordinal for a document (called baseOrdinal `(baseOrd)`). This gives us new APIs that fetch all vectors for a document, which can be used for faster scoring (as opposed to the child doc query in ParentJoin approach): ```java // iterator on vector values for the doc corresponding to provided ord public Iterator<float[]> allVectorValues(int ord); // simpler API, returns iterator on all vectors for doc corresponding to given base ord. public Iterator<float[]> allVectorValues(int baseOrd, int ordCount); // ... same APIs for ByteVectorValues ``` #### Interface The interface to use multi-vector values is quite simple now: ```java // indexing Document doc = new Document(); doc.add(vector1); doc.add(vector2); ... doc.add(vectorN); iw.addDocument(doc); // query KnnFloatMultiVectorQuery query = new KnnFloatMultiVectorQuery(field, target, k); searcher.search(query, k); ``` I was able to add a multi-vector benchmark to luceneutil to run this setup end to end. Will link results and a luceneutil PR in comments. #### Pending Tasks: This is an early draft to get some feedback, I have TODOs across the code for future improvements. Here are some big items pending: - [ ] Backward compatibility for the storage format - [ ] New version for vector storage format (Lucene 111)? - [ ] Support for merging on multi vector values - [ ] Optimization for single-valued vectors (store less metadata) - [ ] Support for scoring based on all vectors of a document (e.g. `ScoreMode.Avg`) - [ ] Unit tests - [ ] Support for multi-valued vectors in quantized vectors. __ **Note:** This change does not include _dependent_ multi-valued vectors like ColBERT, where the multiple vectors must used together to compute similarity. It does however lay essential ground work which can subsequently be extended for this support. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org