dungba88 commented on PR #14009: URL: https://github.com/apache/lucene/pull/14009#issuecomment-2499786440
I have a preliminary benchmark here (top-k=100, fanout=0) using Cohere 768 dataset.  Anyhow I can see these 2 things that should be addressed: - If we access the full-sized vectors, it will swap the memory that is allocated (either through preloading, or through mmap) for quantized vectors (main search phase) if there's not enough memory. Eventually, some % part of the quantized index will be swapped out which will slower the search. If we have to load all full-precision vectors to memory, then that kinda defeats the purpose of quantization. I'm wondering if there could be a way we can access full-precision vectors without interfering with the space of quantized vectors. - The latency could be better. With oversample=1.5 (second dot) for 4_bit, we have around the same latency and recall as baseline. Although one can argue that we can save memory compared to baseline, with new access pattern of two-phase search that saving might be diminished. Otherwise it seems to have little benefit over just using plain HNSW. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org