benwtrent opened a new pull request, #14078: URL: https://github.com/apache/lucene/pull/14078
This provides a binary vector format for vectors. The key ideas are: - Centroid centered vectors - Asymmetric quantization - Individually optimized scalar quantization This allows Lucene to have a single scalar quantization format that allows for high quality vector retrieval, even down to a single bit. For all similarity types, on disk it looks like. | quantized vector | lower quantile | upper quantile| additional correction | sum quantized components| | - | - |- | - | - | | (vector_dimension/8) bytes | float | float | float | short | During segment merge & HNSW building, another temporary file is written containing the query quantized vectors over the configured centroids. One downside is that this temporary file will actually be larger than the regular vector index. This is because we use asymmetric quantization to keep good information around. But once the merge is complete, this file is deleted. I think eventually, this can be removed. Here are the results for _Recall@10|50_ | Dataset | old PR | this one | Improvement | | --- | --- | --- | --- | | Cohere 768 | 0.933 | 0.938 | 0.5% | | Cohere 1024 | 0.932 | 0.945 | 1.3% | | E5-Small-v2 | 0.972 | 0.975 | 0.3% | | GIST-1M | 0.740 | 0.989 | 24.9% | Even with the optimization step, indexing time with HNSW is only marginally increased. | Dataset | OLD PR | This One | Difference | | --- | --- | --- | --- | | Cohere 768 | 368.62s | 372.95s | +1% | | Cohere 1024 | 307.09s | 314.08s | +2% | | E5-Small-v2 | 227.37s | 229.83s | < +1% | The consistent improvement in recall and flexibility for various bits makes this format and quantization technique much preferred. Eventually, we should consider moving scalar quantization to utilize this new optimized quantizer. Though, the on disk format and scoring will change, so, I didn't do that in this PR. supersedes: https://github.com/apache/lucene/pull/13651 Co-Authors: @tveasey @john-wagster @mayya-sharipova -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org