[PR] Add a Better Binary Quantizer (RaBitQ) format for dense vectors [lucene]

via GitHub Tue, 13 Aug 2024 12:02:51 -0700


benwtrent opened a new pull request, #13651:
URL: https://github.com/apache/lucene/pull/13651


   # Not only a draft, but a very rough one indeed
   
   Not opening for the sake of review, but just openness and for those curious 
about the work.
   
   # Highlevel design
   
   RaBitQ is basically a better binary quantization, which works across all 
models we have tested against. Like PQ, it does require coarse grained 
clustering to be effective at higher vector densities (effective being defined 
as only requiring 5x or lower oversampling for recall>95%). But in our testing, 
the number of vectors required per cluster can be exceptionally large (10s to 
100s of millions). 
   
   The vectors as stored in the index:
   
   | quantized vector | distance_to_centroid | vector magnitude|
   | - | - |- |
   | (vector_dimension/8) bytes | float | float |
   
   NOTE: One tricky part I am stuck on is keeping track of `vectorOrdinal -> 
centroidOrdinal`. Right now, I tack on a `byte` at the end of the vectors in 
the file...this will likely throw off paging sizes. So, I am considering of 
throwing the `vectorOrdinal -> centroidOrdinal` to a `LongValues` collection at 
the end of the file...
   
   The vector metadata, in addition to all the regular things (similarity, 
encoding, sparse vector DISI, etc.) keeps track of:
   
    - number of centroids
    - centroids
   
   For indexing into HNSW we actually have a multi-step process. RaBitQ encodes 
the query vectors different than the index vectors. Consequently, during 
segment merge & HNSW building, another temporary file is written containing the 
query quantized vectors over the configured centroids.
   
   We then read from the query temporary file when adding a vector to the graph 
and when exploring HNSW, we search the indexed quantized values.
   
   # What's left
   
    - [x] Initial skeleton design
    - [ ] indexing & scoring for dot-product, cosine & inner-product
    - [ ] indexing & scoring for euclidean
    - [ ] multi-centroid support during search
    - [ ] possibly switch to `LongValues` for storing vectorOrd -> centroidOrd 
mapping
    - [ ] format testing infra
    - [ ] Testing with Lucene Util over various datasets for efficacy
   
   closes: https://github.com/apache/lucene/issues/13650


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[PR] Add a Better Binary Quantizer (RaBitQ) format for dense vectors [lucene]

Reply via email to