kevindrosendahl commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1808717985

   @benwtrent 
   > Thank you @kevindrosendahl this does seem to confirm my suspicion that the 
improvement isn't necessarily due to the data structure, but due to 
quantization. But, this does confuse me as Vamana is supposedly good when 
things don't all fit in memory. It may be due to how we fetch things (MMAP). I 
wonder if Vamana is any good at all when using MMAP...
   
   Yeah, upon reflection the result makes sense. DiskANN only uses the 
in-memory PQ vectors for the distance calculations for deciding which new 
candidates it should consider visiting in the future. This is the critical 
piece which reduces I/O. The in-graph vectors mean that when you have decided 
the next candidate, you get the full fidelity vector for free on the I/O to get 
the adjacency list, but at that point you've already done the distance 
calculation against that candidate. If you were to grab the full fidelity 
vector from the graph for the distance calculation like the non-PQ vamana 
implementations do, you've moved the I/O complexity back from 
`O(candidates_explored)` to `O(nodes_considered)`, and probably need to fetch 
the page again when you've decided to actually consider a candidate to re-grab 
it's adjacency list.
   
   What that full fidelity vector is useful for is re-scoring, so you can just 
keep that full fidelity vector in memory until the end of the search ([jvector 
reference](https://github.com/jbellis/jvector/blob/main/jvector-base/src/main/java/io/github/jbellis/jvector/graph/GraphSearcher.java#L218-L220))
 and resort the final result list using their true distances ([jvector 
reference](https://github.com/jbellis/jvector/blob/main/jvector-base/src/main/java/io/github/jbellis/jvector/graph/GraphSearcher.java#L258C6-L259)).
   
   I'm not sure how impactful this re-ranking is, hoping to A/B it when I get 
PQ working, but assuming it's impactful, the index structure could save up to 
`numCandidates` I/Os. In this case jvector was only doing ~70 I/Os, so adding 
up to 100 on top could be a pretty big deal.
   
   The main thing MMAP may prevent us from "easily" doing is parallelizing the 
I/O, which I believe the DiskANN paper does but jvector does not currently 
(possibly due to the results looking pretty decent without it, and it being 
hard with mmap/Java).
   
   > Or is `pq=` not the number of subspaces, but `vectorDim/pq == 
numberOfSubSpaces`? If so, then recall should reduce as it increases.
   
   This is correct, `pq` in the tables relates to the `pqFactor` in JVector, 
which is as you've described. Agreed with you and @jbellis that the observed 
result is bizarre.
   
   > For your testing infra, int8 quantization might close the gap at that 
scale.
   
   I ran some tests with int8 quantization, nothing very surprising in the 
results. If you can get the full index to fit in memory it performs great, but 
performance degrades significantly once it falls out of memory proportional to 
I/O. The scalar quant vectors were 7.3 GB with the HNSW graph being 329 MB. For 
reference, jvector `pqFactor=8` was 32.45 GB graph with vectors + 0.97 GB 
quantized vectors.
   | index configuration | memory configuration | average recall | average 
latency | average page faults |
   |-|-|-|-|-|
   | hnsw scalar quant | fully in memory | 0.79819 | 0.001259s | 0 |
   | jvector `pqFactor=8` | fully in memory | 0.9121 | 0.00148s | 0 |
   | hnsw scalar quant | 10GB system 2GB heap | 0.79819 | 0.00119s | 0 |
   | hnsw scalar quant |  8GB system 4GB heap | 0.79819 | 0.856s | 606.98 |
   | jvector `pqFactor=8` | 4GB system 2GB heap | 0.9121 | 0.151s | 80.15 |
   
   So fundamentally with these graph structures, the key is to have the vectors 
needed for distance calculations of potential neighbor candidates in memory. 
Scalar quantization helps the cause here by reducing the size of the vectors by 
4. PQ can help even more by even more drastically reducing the memory impact. 
JVector further ensures that the quantized vectors are in memory by storing 
them on the heap.
   
   > FYI, as significant (and needed) refactor occurred recently for int8 
quantization & HNSW, so your testing branch might be significantly impacted :(.
   
   Is there any expected performance difference, or is it mainly 
organizational? The code in my branch is not in a state I would be comfortable 
suggesting for upstreaming, so if it's "just" a problem for that, I'm ok to 
keep my branch a bit outdated, and we could make any suitable ideas fit in 
if/when we thought upstreaming could be worthwhile.
   
   
   > Two significant issues with a Lucene implementation I can think of are:
   > 
   > * Segment merge time: We can maybe do some fancy things with better 
starter centroids in Lloyd's algorithm, but I think we will have to rerun 
Lloyd's algorithm on every merge. Additionally the graph building probably 
cannot be done with the PQ'd vectors.
   > * Graph quality: I don't think we can build the graph with PQ'd vectors 
and retain good recall. Meaning at merge time, we have to page in larger raw 
(or differently quantized) vectors and only do PQ after graph creation.
   > 
   > There are [interesting approaches to PQ w/ graph 
exploration](https://medium.com/@masajiro.iwasaki/fusion-of-graph-based-indexing-and-product-quantization-for-ann-search-7d1f0336d0d0)
 and a different PQ implementation via Microsoft that is worthwhile 
[OPQ](https://www.microsoft.com/en-us/research/wp-content/uploads/2013/11/pami13opq.pdf)
   
   Yeah, seems like for larger-than-memory indexes, some form of clustering or 
quantization is going to be a must have. I do really want to try out SPANN as 
well. It seems analogous to lucene's existing 
FST-as-an-index-to-a-postings-list design, and my hunch is that there may be 
more tricks you can play at merge time to reduce the number of times you need 
to re-cluster things and potentially the computational complexity. It's also 
just a lot simpler conceptually.
   
   As an aside, I'm not sure I'll have too much time to devote to this this 
week, but hoping to continue making forward progress.
   
   If anyone has any other datasets ranging from ~30-50 GB that could be useful 
for testing that would be helpful too.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to