Re: [PR] [WIP] Multi-Vector support for HNSW search [lucene]

via GitHub Wed, 20 Nov 2024 00:02:59 -0800


vigyasharma commented on PR #13525:
URL: https://github.com/apache/lucene/pull/13525#issuecomment-2487597269


   _...contd. from above – thoughts on supporting independent multi-vectors 
specified via `NONE` multi-vector aggregation..._
   __
   
   The `Knn{Float|Byte}Vector` fields will accept multiple vector values for 
documents. Each vector value will be uniquely identifiable by a nodeId. Vectors 
for a doc will be stored adjacent to each other in flat storage. 
KnnVectorValues will support APIs for 1) getting docId for a given nodeId 
(existing), 2) getting vector value for a specific nodeId (existing), 3) 
getting all vector values for the document corresponding to a nodeId (new).
   
   Our codec today has single unique sequentially increasing vector ordinal per 
doc, which we can store and fetch with the DirectMonotonicWriter. For 
multi-vectors, we need to handle multiple nodeIds mapping to a single document.
   
   I'm thinking of using "ordinals" and "sub-ordinals" to identify each vector 
value. 'Ordinal' is incremented when docId changes. 'Sub-ordinals' start at 0 
for each new doc and are incremented for subsequent vector values in the doc. A 
nodeId in the graph, is a "long" with ordinals and sub-ordinals packed into MSB 
and LSB bits separately. 
   
   For flat storage, we can continue to use the technique in this PR; i.e. have 
one DirectMonotonicWriter object for docIds indexed by "ordinals", and another 
that stores start offsets for each docId, again indexed by ordinals. The 
sub-ordinal bits help us seek to exact vector values from this metadata.
   
   ```java
   int ordToDoc(long nodeId) {
     // get int ordinal from most-significant 32 bits
     // get docId for the ordinal from DirectMonotonicWriter
   }
   
   float[] vectorValue(int nodeId) {
     // get int ordinal from most-significant 32 bits
     // get "startOffset" for ordinal
     // get subOrdinal from least-signifant 32 bits
     // read vector value from startOffset + (subOrdinal * dimension * byteSize)
   }
   
   float[] getAllVectorValues(int nodeId) {
     // get int ordinal from most-significant 32 bits
     // get "startOffset" for ordinal
     // get "endOffset" from offset value for ordinal + 1
     // return values from [startOffset, endOffset)
   }
   ```
   
   With this setup, we won't need parent-block join queries for multiple vector 
values. And we can use `getAllVectorValues()` for scoring with max or avg of 
all vectors in the doc at query time.
   
   I'm skeptical if this'll give a visible performance boost. It should at 
least be similar to the block-join setup we have today, but hopefully more 
convenient to use. And it sets us up for "dependent" multi-vector values like 
ColBERT.
   
   We'll need to code this up to iron out any wrinkles. I can work on a draft 
PR if the idea makes sense.
   __
   
   Note that this still doesn't allow >2B vector values. While the "long" 
nodeId can support it, our ANN impl. returns arrays containing all nodeIds is 
various places. I don't think java can support >2B array length. But we can 
address this limitation separately, perhaps with a different ANN algo for such 
high cardinality graphs.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] [WIP] Multi-Vector support for HNSW search [lucene]

Reply via email to