benwtrent commented on issue #12615: URL: https://github.com/apache/lucene/issues/12615#issuecomment-1808340050
Thank you @kevindrosendahl this does seem to confirm my suspicion that the improvement isn't necessarily due to the data structure, but due to quantization. But, this does confuse me as Vamana is supposedly good when things don't all fit in memory. It may be due to how we fetch things (MMAP). I wonder if Vamana is any good at all when using MMAP... For your testing infra, int8 quantization might close the gap at that scale. FYI, as significant (and needed) refactor occurred recently for int8 quantization & HNSW, so your testing branch might be significantly impacted :(. > recall actually improves when introducing pq, and only starts to decrease at a factor of 16 I am surprised it decreases as the number of sub-spaces increases. This makes me thing that JVector's PQ implementation is weird. Or is `pq=` not the number of subspaces, but `vectorDim/pq == numberOfSubSpaces`? If so, then recall should reduce as it increases. Regardless, is there any oversampling that is occurring when PQ is enabled in JVector? > It's great that PQ goes quite a ways before hurting recall. PQ is a sharp tool, at scale it can have significant draw backs (eventually you have to start dramatically oversampling as centroids get very noisy). Though, I am not sure there is a significant recall cliff. Two significant issues with a Lucene implementation I can think of are: - Segment merge time: We can maybe do some fancy things with better starter centroids in Lloyd's algorithm, but I think we will have to rerun Lloyd's algorithm on every merge. Additionally the graph building probably cannot be done with the PQ'd vectors. - Graph quality: I don't think we can build the graph with PQ'd vectors and retain good recall. Meaning at merge time, we have to page in larger raw (or differently quantized) vectors and only do PQ after graph creation. There are [interesting approaches to PQ w/ graph exploration](https://medium.com/@masajiro.iwasaki/fusion-of-graph-based-indexing-and-product-quantization-for-ann-search-7d1f0336d0d0) and a different PQ implementation via Microsoft that is worthwhile [OPQ](https://www.microsoft.com/en-us/research/wp-content/uploads/2013/11/pami13opq.pdf) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org