Re: [PR] [Draft] Support Multi-Vector HNSW Search via Flat Vector Storage [lucene]

via GitHub Sat, 01 Feb 2025 15:56:52 -0800


vigyasharma commented on PR #14173:
URL: https://github.com/apache/lucene/pull/14173#issuecomment-2629167044


   > I think this PR is still doing globally unique ordinals for vectors? So, 
vectors 1, 2, 3 go to document 1 and ordinals 4, 5 go to doc 2? If so, I think 
we should "bite the bullet" and make vector ordinals long values. 
   
   That is correct. I did play around with "long" nodeIDs in an alternate 
implementation but ran into a whole bunch of challenges:
   1. In a lot of places, we hold all nodes of the graph (or at a level) in a 
single array, and use vector ordinals as array index addresses. Java limits the 
size of arrays (and lists) to 'int max' and does not allow 'long' array 
indices. These will need to be changed to use a different data structure.
   2. We use bitsets on vector ordinals for multiple use cases, and almost all 
bitsets work with integers. There is one `LongBitSet` but it doesn't integrate 
well with BitSetIterators etc.
   
   At one point, I had this idea of representing vectors using `ordinals` and 
`subOrdinals`, where ordinals correspond to documents, i.e. one ordinal per 
document, and subOrdinals to the multiple vector values per document. For 
example, for a multi-vector with 3 values, the (ord, subOrd) values would be 
(ord, 0), (ord, 1), (ord, 2). For single valued vectors, we will only have 
subOrdinal=0 for each vector. 
   
   I would then pack the ordinal and subOrdinal values into a single `long` 
graph nodeId. Ordinal took the lower 32 bits, and subOrdinal took 32 MSB. Since 
we use VLong to write them, we don't need extra storage for single valued 
vectors where subOrdinal bits were all zero. This was also neat because I could 
score `>INT_MAX` vectors in the graph, and use bit masks to find the 
baseOrdinal (first vector of the doc) from any graph nodeId. 
   
   However, in addition to the "long" nodeId challenges above, this approach 
broke latent assumptions like "vector ordinals are densely packed".  For 
example, we could no longer assume that maxOrdinal is the graph size. Hence I 
pivoted back to "int ordinals" and started maintaining `ordToDoc` maps with 
many-one mapping.
   Given the volume of changes with "long" graph nodeIds, I wonder if we should 
do it when we add a new ANN implementation (like DiskANN maybe?).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] [Draft] Support Multi-Vector HNSW Search via Flat Vector Storage [lucene]

Reply via email to