huynmg opened a new issue, #14984: URL: https://github.com/apache/lucene/issues/14984
### Description Inspired by the asymmetric quantization technique used in [BBQ](https://www.elastic.co/search-labs/blog/better-binary-quantization-lucene-elasticsearch#asymmetric-quantization,-the-interesting-bits), I'm exploring to adopt the same idea to scalar quantization. The core idea is to quantize query using more bits than the number of bits used to quantize document vectors at indexing time . This helps reduce information loss during quantization, leading to a better approximated distance and eventually higher recall. Importantly, we can achieves this without increasing the hot RAM memory for loading document vectors I made a quick prototype to test the idea and benchmarked with following settings : - Document vectors quantized to 4 bit - Queries vectors are quantized to 4-bit, 7-bit and 15-bit - Dot product as similarty score - Same parameters as the nightly benchmark on the Coehere 768-dimension dataset. #### 4 bit query + 4 bit index ``` recall latency(ms) netCPU avgCpuCount nDoc topK fanout maxConn beamWidth quantized index(s) index_docs/s num_segments index_size(MB) vec_disk(MB) vec_RAM(MB) indexType 0.468 2.926 24.651 8.425 8000000 100 50 32 100 4 bits 2682.55 2982.24 17 29879.53 29327.393 5889.893 HNSW ``` #### 7 bit query + 4 bit index ``` recall latency(ms) netCPU avgCpuCount nDoc topK fanout maxConn beamWidth quantized index(s) index_docs/s num_segments index_size(MB) vec_disk(MB) vec_RAM(MB) indexType 0.572 2.966 24.714 8.332 8000000 100 50 32 100 4 bits 0.00 Infinity 17 29879.53 29327.393 5889.893 HNSW ``` #### 15 bit query + 4 bit index ``` recall latency(ms) netCPU avgCpuCount nDoc topK fanout maxConn beamWidth quantized index(s) index_docs/s num_segments index_size(MB) vec_disk(MB) vec_RAM(MB) indexType 0.574 3.098 26.096 8.423 8000000 100 50 32 100 4 bits 0.00 Infinity 17 29879.53 29327.393 5889.893 HNSW ``` We could see that quantizing query to 7 bits and using it to compute dot product score yields approximately a ~10% higher recall compared to 4-bit query quantization, with almost the same latency. Increasing query quantization to 15 bits i.e.. each dimension represented by an `int` instead of `byte`, provides only a marginal additional recall gain over 7-bit quantization This asymmetric quantization technique can also be leveraged for reranking without incurring additional hot RAM memory costs. A typical reranking setup often requires loading additional, higher-precision vectors (e.g., float vectors) into RAM to compute a reranking score. By using a higher-bit quantized query against a lower-bit indexed vector, we can compute the reranking score without this memory overhead. The recall latency tradeoff looks nice so I just want to share a quick result and discuss if it's worth pursuing the idea. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org