vigyasharma commented on PR #13525: URL: https://github.com/apache/lucene/pull/13525#issuecomment-2487597269
_...contd. from above – thoughts on supporting independent multi-vectors specified via `NONE` multi-vector aggregation..._ __ The `Knn{Float|Byte}Vector` fields will accept multiple vector values for documents. Each vector value will be uniquely identifiable by a nodeId. Vectors for a doc will be stored adjacent to each other in flat storage. KnnVectorValues will support APIs for 1) getting docId for a given nodeId (existing), 2) getting vector value for a specific nodeId (existing), 3) getting all vector values for the document corresponding to a nodeId (new). Our codec today has single unique sequentially increasing vector ordinal per doc, which we can store and fetch with the DirectMonotonicWriter. For multi-vectors, we need to handle multiple nodeIds mapping to a single document. I'm thinking of using "ordinals" and "sub-ordinals" to identify each vector value. 'Ordinal' is incremented when docId changes. 'Sub-ordinals' start at 0 for each new doc and are incremented for subsequent vector values in the doc. A nodeId in the graph, is a "long" with ordinals and sub-ordinals packed into MSB and LSB bits separately. For flat storage, we can continue to use the technique in this PR; i.e. have one DirectMonotonicWriter object for docIds indexed by "ordinals", and another that stores start offsets for each docId, again indexed by ordinals. The sub-ordinal bits help us seek to exact vector values from this metadata. ```java int ordToDoc(long nodeId) { // get int ordinal from most-significant 32 bits // get docId for the ordinal from DirectMonotonicWriter } float[] vectorValue(int nodeId) { // get int ordinal from most-significant 32 bits // get "startOffset" for ordinal // get subOrdinal from least-signifant 32 bits // read vector value from startOffset + (subOrdinal * dimension * byteSize) } float[] getAllVectorValues(int nodeId) { // get int ordinal from most-significant 32 bits // get "startOffset" for ordinal // get "endOffset" from offset value for ordinal + 1 // return values from [startOffset, endOffset) } ``` With this setup, we won't need parent-block join queries for multiple vector values. And we can use `getAllVectorValues()` for scoring with max or avg of all vectors in the doc at query time. I'm skeptical if this'll give a visible performance boost. It should at least be similar to the block-join setup we have today, but hopefully more convenient to use. And it sets us up for "dependent" multi-vector values like ColBERT. We'll need to code this up to iron out any wrinkles. I can work on a draft PR if the idea makes sense. __ Note that this still doesn't allow >2B vector values. While the "long" nodeId can support it, our ANN impl. returns arrays containing all nodeIds is various places. I don't think java can support >2B array length. But we can address this limitation separately, perhaps with a different ANN algo for such high cardinality graphs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org