benwtrent opened a new pull request, #13651:
URL: https://github.com/apache/lucene/pull/13651
# Not only a draft, but a very rough one indeed
Not opening for the sake of review, but just openness and for those curious
about the work.
# Highlevel design
RaBitQ is basically a better binary quantization, which works across all
models we have tested against. Like PQ, it does require coarse grained
clustering to be effective at higher vector densities (effective being defined
as only requiring 5x or lower oversampling for recall>95%). But in our testing,
the number of vectors required per cluster can be exceptionally large (10s to
100s of millions).
The vectors as stored in the index:
| quantized vector | distance_to_centroid | vector magnitude|
| - | - |- |
| (vector_dimension/8) bytes | float | float |
NOTE: One tricky part I am stuck on is keeping track of `vectorOrdinal ->
centroidOrdinal`. Right now, I tack on a `byte` at the end of the vectors in
the file...this will likely throw off paging sizes. So, I am considering of
throwing the `vectorOrdinal -> centroidOrdinal` to a `LongValues` collection at
the end of the file...
The vector metadata, in addition to all the regular things (similarity,
encoding, sparse vector DISI, etc.) keeps track of:
- number of centroids
- centroids
For indexing into HNSW we actually have a multi-step process. RaBitQ encodes
the query vectors different than the index vectors. Consequently, during
segment merge & HNSW building, another temporary file is written containing the
query quantized vectors over the configured centroids.
We then read from the query temporary file when adding a vector to the graph
and when exploring HNSW, we search the indexed quantized values.
# What's left
- [x] Initial skeleton design
- [ ] indexing & scoring for dot-product, cosine & inner-product
- [ ] indexing & scoring for euclidean
- [ ] multi-centroid support during search
- [ ] possibly switch to `LongValues` for storing vectorOrd -> centroidOrd
mapping
- [ ] format testing infra
- [ ] Testing with Lucene Util over various datasets for efficacy
closes: https://github.com/apache/lucene/issues/13650
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]