huynmg opened a new issue, #14984:
URL: https://github.com/apache/lucene/issues/14984

   ### Description
   
   Inspired by the asymmetric quantization technique used in 
[BBQ](https://www.elastic.co/search-labs/blog/better-binary-quantization-lucene-elasticsearch#asymmetric-quantization,-the-interesting-bits),
 I'm exploring to adopt the same idea to scalar quantization. The core idea is 
to quantize query using more bits than the number of bits used to quantize 
document vectors at indexing time . This helps reduce information loss during 
quantization, leading to a better approximated distance and eventually higher 
recall. Importantly, we can achieves this without increasing the hot RAM memory 
for loading  document vectors 
   
   I made  a quick prototype to test the idea and benchmarked with following 
settings :
   - Document vectors quantized to 4 bit
   - Queries vectors are quantized to 4-bit, 7-bit and 15-bit
   - Dot product  as similarty score
   - Same parameters as the nightly benchmark on the Coehere 768-dimension 
dataset. 
   
   #### 4 bit query + 4 bit index
   
   ```
   
   recall  latency(ms)  netCPU  avgCpuCount     nDoc  topK  fanout  maxConn  
beamWidth  quantized  index(s)  index_docs/s  num_segments  index_size(MB)  
vec_disk(MB)  vec_RAM(MB)  indexType
    0.468        2.926  24.651        8.425  8000000   100      50       32     
   100     4 bits   2682.55       2982.24            17        29879.53     
29327.393     5889.893       HNSW
   ```
   #### 7 bit query + 4 bit index
   
   ```
   recall  latency(ms)  netCPU  avgCpuCount     nDoc  topK  fanout  maxConn  
beamWidth  quantized  index(s)  index_docs/s  num_segments  index_size(MB)  
vec_disk(MB)  vec_RAM(MB)  indexType
    0.572        2.966  24.714        8.332  8000000   100      50       32     
   100     4 bits      0.00      Infinity            17        29879.53     
29327.393     5889.893       HNSW
   ```
   
   #### 15 bit query + 4 bit index 
   ```
   recall  latency(ms)  netCPU  avgCpuCount     nDoc  topK  fanout  maxConn  
beamWidth  quantized  index(s)  index_docs/s  num_segments  index_size(MB)  
vec_disk(MB)  vec_RAM(MB)  indexType
    0.574        3.098  26.096        8.423  8000000   100      50       32     
   100     4 bits      0.00      Infinity            17        29879.53     
29327.393     5889.893       HNSW
   ```
   
   
   We could see that quantizing query to 7 bits  and using it to compute dot 
product score  yields approximately a ~10% higher recall  compared to 4-bit 
query quantization, with almost the same latency. Increasing query quantization 
to 15 bits  i.e.. each dimension represented by an `int` instead of `byte`,  
provides only a marginal additional recall gain over 7-bit quantization
   
   
   This asymmetric quantization technique can also be leveraged for reranking 
without incurring additional hot RAM memory costs. A typical reranking setup 
often requires loading additional, higher-precision vectors (e.g., float 
vectors) into RAM to compute a reranking score. By using a higher-bit quantized 
query against a lower-bit indexed vector, we can compute the reranking score 
without this memory overhead.
   
   The recall latency tradeoff looks nice so I just want to share a quick 
result and discuss if it's worth pursuing the idea. 
    
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to