dungba88 commented on PR #14009:
URL: https://github.com/apache/lucene/pull/14009#issuecomment-2499786440
   I have a preliminary benchmark here (top-k=100, fanout=0) using Cohere 768 
dataset.
   
   
![image](https://github.com/user-attachments/assets/d40fdc53-019b-4515-bff4-a29162d9b9da)
   
   Anyhow I can see these 2 things that should be addressed:
   - If we access the full-sized vectors, it will swap the memory that is 
allocated (either through preloading, or through mmap) for quantized vectors 
(main search phase) if there's not enough memory. Eventually, some % part of 
the quantized index will be swapped out which will slower the search. If we 
have to load all full-precision vectors to memory, then that kinda defeats the 
purpose of quantization. I'm wondering if there could be a way we can access 
full-precision vectors without interfering with the space of quantized vectors.
   - The latency could be better. With oversample=1.5 (second dot) for 4_bit, 
we have around the same latency and recall as baseline. Although one can argue 
that we can save memory compared to baseline, with new access pattern of 
two-phase search that saving might be diminished. Otherwise it seems to have 
little benefit over just using plain HNSW.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to