Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

via GitHub Mon, 13 Nov 2023 07:06:58 -0800


benwtrent commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1808340050

Thank you @kevindrosendahl this does seem to confirm my suspicion that the
improvement isn't necessarily due to the data structure, but due to
quantization. But, this does confuse me as Vamana is supposedly good when
things don't all fit in memory. It may be due to how we fetch things (MMAP). I
wonder if Vamana is any good at all when using MMAP...

For your testing infra, int8 quantization might close the gap at that scale.
FYI, as significant (and needed) refactor occurred recently for int8
quantization & HNSW, so your testing branch might be significantly impacted :(.

> recall actually improves when introducing pq, and only starts to decrease
at a factor of 16

I am surprised it decreases as the number of sub-spaces increases. This
makes me thing that JVector's PQ implementation is weird.

Or is `pq=` not the number of subspaces, but `vectorDim/pq ==
numberOfSubSpaces`? If so, then recall should reduce as it increases.

Regardless, is there any oversampling that is occurring when PQ is enabled
in JVector?

> It's great that PQ goes quite a ways before hurting recall.

PQ is a sharp tool, at scale it can have significant draw backs (eventually
you have to start dramatically oversampling as centroids get very noisy).
Though, I am not sure there is a significant recall cliff.

Two significant issues with a Lucene implementation I can think of are:

- Segment merge time: We can maybe do some fancy things with better starter
centroids in Lloyd's algorithm, but I think we will have to rerun Lloyd's
algorithm on every merge. Additionally the graph building probably cannot be
done with the PQ'd vectors.
- Graph quality: I don't think we can build the graph with PQ'd vectors and
retain good recall. Meaning at merge time, we have to page in larger raw (or
differently quantized) vectors and only do PQ after graph creation.

There are [interesting approaches to PQ w/ graph
exploration](https://medium.com/@masajiro.iwasaki/fusion-of-graph-based-indexing-and-product-quantization-for-ann-search-7d1f0336d0d0)
and a different PQ implementation via Microsoft that is worthwhile
[OPQ](https://www.microsoft.com/en-us/research/wp-content/uploads/2013/11/pami13opq.pdf)

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

Reply via email to