[I] Add higher quantization level for kNN vector search [lucene]

via GitHub Tue, 13 Aug 2024 10:24:22 -0700


benwtrent opened a new issue, #13650:
URL: https://github.com/apache/lucene/issues/13650


   ### Description
   
   There have been multiple discussions around supporting higher vector 
quantization levels (e.g. more than scalar quantization can provide).
   
   We (elastic) have been doing some research around what the best next step 
will be for higher levels of quantization in Lucene. 
   
   A couple of months ago we came across an interesting paper: 
https://arxiv.org/abs/2405.12497 and have been actively experimenting with 
RaBitQ. 
   
   Its approach to quantization fits well with Lucene:
   
    - Higher levels of compression (32x for floating point)
    - Very cheap quantization costs
    - Fast vector similarities
    - Lower levels of reranking required
   
   In our investigations we have found:
   
    - RaBitQ achieves the same or better recall at the same compression ratios 
as Product Quantization.
    - RaBitQ is 20-30x faster at quantizing
    - RaBitQ (in Java) is more than 2x faster to query than PQ (can likely be 
made faster, where PQ cannot really...)
   
   Let me share some PoC numbers. All our testing with PQ assumed a 32x 
reduction. Higher levels of reduction then require more reranking, which then 
becomes the main bottleneck. Our goal was to achieve good recall at only 5x 
reranking and without Optimized PQ (which is insanely expensive, the PQ already 
is very expensive).
   
   30M cohereV3 (on a GCP intel machine avx512):
   
   - RaBitQ:
     - Time to quantize vectors: `998363ms`
     - Avg Brute-force query time for Recall@10:50 `1776ms` with 98% recall
     - Time to build HNSW graph: 40043229ms
        - HNSW recall@10:50: 86.2% at 1641qps
   - PQ:
     - Time to calculate codebooks: `472260ms`
     - Time to quantize vectors: `12644293ms`
     - Avg brute-force query time for Recall@10:50 `5790ms` with recall 98% 
recall.
     - WE didn't bother building the large graph here as, frankly, its PQ is 
just too slow.
   
   522k Quora encoded with e5Small (noted for being horrible with binary 
quantization & higher compression ratios in general) (On my macbook, Arm M1-MAx)
   
   - RaBitQ:
     - 1051ms to quantize vectors
     - Avg Brute-force query time for Recall@10:50 `11ms` with 99% recall
   - PQ:
     - 28360ms to build the code books
     - 31364ms to encode the docs
     - Avg Brute-force query time for Recall@10:50 `19ms` with 99% recall
   
   Consequently, given all the various trade-offs and benefits, we are working 
on bringing a Rabitq format into Lucene.
   
   I will open a draft PR soon to make the work more visible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[I] Add higher quantization level for kNN vector search [lucene]

Reply via email to