benwtrent opened a new pull request, #13651: URL: https://github.com/apache/lucene/pull/13651
# Not only a draft, but a very rough one indeed Not opening for the sake of review, but just openness and for those curious about the work. # Highlevel design RaBitQ is basically a better binary quantization, which works across all models we have tested against. Like PQ, it does require coarse grained clustering to be effective at higher vector densities (effective being defined as only requiring 5x or lower oversampling for recall>95%). But in our testing, the number of vectors required per cluster can be exceptionally large (10s to 100s of millions). The vectors as stored in the index: | quantized vector | distance_to_centroid | vector magnitude| | - | - |- | | (vector_dimension/8) bytes | float | float | NOTE: One tricky part I am stuck on is keeping track of `vectorOrdinal -> centroidOrdinal`. Right now, I tack on a `byte` at the end of the vectors in the file...this will likely throw off paging sizes. So, I am considering of throwing the `vectorOrdinal -> centroidOrdinal` to a `LongValues` collection at the end of the file... The vector metadata, in addition to all the regular things (similarity, encoding, sparse vector DISI, etc.) keeps track of: - number of centroids - centroids For indexing into HNSW we actually have a multi-step process. RaBitQ encodes the query vectors different than the index vectors. Consequently, during segment merge & HNSW building, another temporary file is written containing the query quantized vectors over the configured centroids. We then read from the query temporary file when adding a vector to the graph and when exploring HNSW, we search the indexed quantized values. # What's left - [x] Initial skeleton design - [ ] indexing & scoring for dot-product, cosine & inner-product - [ ] indexing & scoring for euclidean - [ ] multi-centroid support during search - [ ] possibly switch to `LongValues` for storing vectorOrd -> centroidOrd mapping - [ ] format testing infra - [ ] Testing with Lucene Util over various datasets for efficacy closes: https://github.com/apache/lucene/issues/13650 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org