kevindrosendahl commented on issue #12615: URL: https://github.com/apache/lucene/issues/12615#issuecomment-1808717985
@benwtrent > Thank you @kevindrosendahl this does seem to confirm my suspicion that the improvement isn't necessarily due to the data structure, but due to quantization. But, this does confuse me as Vamana is supposedly good when things don't all fit in memory. It may be due to how we fetch things (MMAP). I wonder if Vamana is any good at all when using MMAP... Yeah, upon reflection the result makes sense. DiskANN only uses the in-memory PQ vectors for the distance calculations for deciding which new candidates it should consider visiting in the future. This is the critical piece which reduces I/O. The in-graph vectors mean that when you have decided the next candidate, you get the full fidelity vector for free on the I/O to get the adjacency list, but at that point you've already done the distance calculation against that candidate. If you were to grab the full fidelity vector from the graph for the distance calculation like the non-PQ vamana implementations do, you've moved the I/O complexity back from `O(candidates_explored)` to `O(nodes_considered)`, and probably need to fetch the page again when you've decided to actually consider a candidate to re-grab it's adjacency list. What that full fidelity vector is useful for is re-scoring, so you can just keep that full fidelity vector in memory until the end of the search ([jvector reference](https://github.com/jbellis/jvector/blob/main/jvector-base/src/main/java/io/github/jbellis/jvector/graph/GraphSearcher.java#L218-L220)) and resort the final result list using their true distances ([jvector reference](https://github.com/jbellis/jvector/blob/main/jvector-base/src/main/java/io/github/jbellis/jvector/graph/GraphSearcher.java#L258C6-L259)). I'm not sure how impactful this re-ranking is, hoping to A/B it when I get PQ working, but assuming it's impactful, the index structure could save up to `numCandidates` I/Os. In this case jvector was only doing ~70 I/Os, so adding up to 100 on top could be a pretty big deal. The main thing MMAP may prevent us from "easily" doing is parallelizing the I/O, which I believe the DiskANN paper does but jvector does not currently (possibly due to the results looking pretty decent without it, and it being hard with mmap/Java). > Or is `pq=` not the number of subspaces, but `vectorDim/pq == numberOfSubSpaces`? If so, then recall should reduce as it increases. This is correct, `pq` in the tables relates to the `pqFactor` in JVector, which is as you've described. Agreed with you and @jbellis that the observed result is bizarre. > For your testing infra, int8 quantization might close the gap at that scale. I ran some tests with int8 quantization, nothing very surprising in the results. If you can get the full index to fit in memory it performs great, but performance degrades significantly once it falls out of memory proportional to I/O. The scalar quant vectors were 7.3 GB with the HNSW graph being 329 MB. For reference, jvector `pqFactor=8` was 32.45 GB graph with vectors + 0.97 GB quantized vectors. | index configuration | memory configuration | average recall | average latency | average page faults | |-|-|-|-|-| | hnsw scalar quant | fully in memory | 0.79819 | 0.001259s | 0 | | jvector `pqFactor=8` | fully in memory | 0.9121 | 0.00148s | 0 | | hnsw scalar quant | 10GB system 2GB heap | 0.79819 | 0.00119s | 0 | | hnsw scalar quant | 8GB system 4GB heap | 0.79819 | 0.856s | 606.98 | | jvector `pqFactor=8` | 4GB system 2GB heap | 0.9121 | 0.151s | 80.15 | So fundamentally with these graph structures, the key is to have the vectors needed for distance calculations of potential neighbor candidates in memory. Scalar quantization helps the cause here by reducing the size of the vectors by 4. PQ can help even more by even more drastically reducing the memory impact. JVector further ensures that the quantized vectors are in memory by storing them on the heap. > FYI, as significant (and needed) refactor occurred recently for int8 quantization & HNSW, so your testing branch might be significantly impacted :(. Is there any expected performance difference, or is it mainly organizational? The code in my branch is not in a state I would be comfortable suggesting for upstreaming, so if it's "just" a problem for that, I'm ok to keep my branch a bit outdated, and we could make any suitable ideas fit in if/when we thought upstreaming could be worthwhile. > Two significant issues with a Lucene implementation I can think of are: > > * Segment merge time: We can maybe do some fancy things with better starter centroids in Lloyd's algorithm, but I think we will have to rerun Lloyd's algorithm on every merge. Additionally the graph building probably cannot be done with the PQ'd vectors. > * Graph quality: I don't think we can build the graph with PQ'd vectors and retain good recall. Meaning at merge time, we have to page in larger raw (or differently quantized) vectors and only do PQ after graph creation. > > There are [interesting approaches to PQ w/ graph exploration](https://medium.com/@masajiro.iwasaki/fusion-of-graph-based-indexing-and-product-quantization-for-ann-search-7d1f0336d0d0) and a different PQ implementation via Microsoft that is worthwhile [OPQ](https://www.microsoft.com/en-us/research/wp-content/uploads/2013/11/pami13opq.pdf) Yeah, seems like for larger-than-memory indexes, some form of clustering or quantization is going to be a must have. I do really want to try out SPANN as well. It seems analogous to lucene's existing FST-as-an-index-to-a-postings-list design, and my hunch is that there may be more tricks you can play at merge time to reduce the number of times you need to re-cluster things and potentially the computational complexity. It's also just a lot simpler conceptually. As an aside, I'm not sure I'll have too much time to devote to this this week, but hoping to continue making forward progress. If anyone has any other datasets ranging from ~30-50 GB that could be useful for testing that would be helpful too. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org