Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

via GitHub Mon, 13 Nov 2023 10:00:23 -0800


kevindrosendahl commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1808717985

@benwtrent
> Thank you @kevindrosendahl this does seem to confirm my suspicion that the
improvement isn't necessarily due to the data structure, but due to
quantization. But, this does confuse me as Vamana is supposedly good when
things don't all fit in memory. It may be due to how we fetch things (MMAP). I
wonder if Vamana is any good at all when using MMAP...

Yeah, upon reflection the result makes sense. DiskANN only uses the
in-memory PQ vectors for the distance calculations for deciding which new
candidates it should consider visiting in the future. This is the critical
piece which reduces I/O. The in-graph vectors mean that when you have decided
the next candidate, you get the full fidelity vector for free on the I/O to get
the adjacency list, but at that point you've already done the distance
calculation against that candidate. If you were to grab the full fidelity
vector from the graph for the distance calculation like the non-PQ vamana
implementations do, you've moved the I/O complexity back from
`O(candidates_explored)` to `O(nodes_considered)`, and probably need to fetch
the page again when you've decided to actually consider a candidate to re-grab
it's adjacency list.

What that full fidelity vector is useful for is re-scoring, so you can just
keep that full fidelity vector in memory until the end of the search ([jvector
reference](https://github.com/jbellis/jvector/blob/main/jvector-base/src/main/java/io/github/jbellis/jvector/graph/GraphSearcher.java#L218-L220))
and resort the final result list using their true distances ([jvector
reference](https://github.com/jbellis/jvector/blob/main/jvector-base/src/main/java/io/github/jbellis/jvector/graph/GraphSearcher.java#L258C6-L259)).

I'm not sure how impactful this re-ranking is, hoping to A/B it when I get
PQ working, but assuming it's impactful, the index structure could save up to
`numCandidates` I/Os. In this case jvector was only doing ~70 I/Os, so adding
up to 100 on top could be a pretty big deal.

The main thing MMAP may prevent us from "easily" doing is parallelizing the
I/O, which I believe the DiskANN paper does but jvector does not currently
(possibly due to the results looking pretty decent without it, and it being
hard with mmap/Java).

> Or is `pq=` not the number of subspaces, but `vectorDim/pq ==
numberOfSubSpaces`? If so, then recall should reduce as it increases.

This is correct, `pq` in the tables relates to the `pqFactor` in JVector,
which is as you've described. Agreed with you and @jbellis that the observed
result is bizarre.

> For your testing infra, int8 quantization might close the gap at that
scale.

I ran some tests with int8 quantization, nothing very surprising in the
results. If you can get the full index to fit in memory it performs great, but
performance degrades significantly once it falls out of memory proportional to
I/O. The scalar quant vectors were 7.3 GB with the HNSW graph being 329 MB. For
reference, jvector `pqFactor=8` was 32.45 GB graph with vectors + 0.97 GB
quantized vectors.
| index configuration | memory configuration | average recall | average
latency | average page faults |
|-|-|-|-|-|
| hnsw scalar quant | fully in memory | 0.79819 | 0.001259s | 0 |
| jvector `pqFactor=8` | fully in memory | 0.9121 | 0.00148s | 0 |
| hnsw scalar quant | 10GB system 2GB heap | 0.79819 | 0.00119s | 0 |
| hnsw scalar quant | 8GB system 4GB heap | 0.79819 | 0.856s | 606.98 |
| jvector `pqFactor=8` | 4GB system 2GB heap | 0.9121 | 0.151s | 80.15 |

So fundamentally with these graph structures, the key is to have the vectors
needed for distance calculations of potential neighbor candidates in memory.
Scalar quantization helps the cause here by reducing the size of the vectors by
4. PQ can help even more by even more drastically reducing the memory impact.
JVector further ensures that the quantized vectors are in memory by storing
them on the heap.

> FYI, as significant (and needed) refactor occurred recently for int8
quantization & HNSW, so your testing branch might be significantly impacted :(.

Is there any expected performance difference, or is it mainly
organizational? The code in my branch is not in a state I would be comfortable
suggesting for upstreaming, so if it's "just" a problem for that, I'm ok to
keep my branch a bit outdated, and we could make any suitable ideas fit in
if/when we thought upstreaming could be worthwhile.

> Two significant issues with a Lucene implementation I can think of are:
>
> * Segment merge time: We can maybe do some fancy things with better
starter centroids in Lloyd's algorithm, but I think we will have to rerun
Lloyd's algorithm on every merge. Additionally the graph building probably
cannot be done with the PQ'd vectors.
> * Graph quality: I don't think we can build the graph with PQ'd vectors
and retain good recall. Meaning at merge time, we have to page in larger raw
(or differently quantized) vectors and only do PQ after graph creation.
>
> There are [interesting approaches to PQ w/ graph
exploration](https://medium.com/@masajiro.iwasaki/fusion-of-graph-based-indexing-and-product-quantization-for-ann-search-7d1f0336d0d0)
and a different PQ implementation via Microsoft that is worthwhile
[OPQ](https://www.microsoft.com/en-us/research/wp-content/uploads/2013/11/pami13opq.pdf)

Yeah, seems like for larger-than-memory indexes, some form of clustering or
quantization is going to be a must have. I do really want to try out SPANN as
well. It seems analogous to lucene's existing
FST-as-an-index-to-a-postings-list design, and my hunch is that there may be
more tricks you can play at merge time to reduce the number of times you need
to re-cluster things and potentially the computational complexity. It's also
just a lot simpler conceptually.

As an aside, I'm not sure I'll have too much time to devote to this this
week, but hoping to continue making forward progress.

If anyone has any other datasets ranging from ~30-50 GB that could be useful
for testing that would be helpful too.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

Reply via email to