kevindrosendahl commented on issue #12615: URL: https://github.com/apache/lucene/issues/12615#issuecomment-1868095892
Hi all, thanks for the patience, some interesting and exciting results to share. TL;DR: - DiskANN doesn't seem to lose any performance relative to HNSW when fully in memory, and may actually be faster - able to slightly beat JVector's larger-than-memory performance in Lucene using the same algorithm - able to get similar performance (sometimes better sometimes worse depending on the amount of available memory) when _not_ storing the full fidelity vectors in the graph, and doing the reranking without having cached the vectors during graph traversal - able to further improve performance ~7-10x by parallelizing the I/O during the reranking phase using [io_uring](https://en.wikipedia.org/wiki/Io_uring) when storing the vectors outside of the graph - with parallelized reranking and pinning the (relatively small) graph and PQ vectors in memory, against a 100 GB index latency when constraining memory to only 8GB is only about 40% slower than fully in memory lucene HNSW (and achieves ~10% higher recall given the tested configurations) - without pinning the graph and PQ vectors, the benefits of parallel I/O during reranking depend on how good the page cache is at keeping the graph and PQ vectors in memory. if they get paged out, the page faults during graph traversal can negate some or most of the gains that parallel I/O offers ### Algorithm Analysis The DiskANN algorithm consists of two phases: 1. graph traversal using PQ vectors for distance calculations 2. reranking the top *k* after finishing graph traversal by recalculating distance using the full fidelity vector instead of PQ vectors The original algorithm (and JVector's implementation) store the full fidelity vector and the adjacency list for a graph node together on disk. This means that when you retrieve the adjacency list during the first phase, the full fidelity vector will also be in memory due to locality. Immediately after retrieving the adjacency list, you cache the full fidelity vector. When re-ranking in phase 2, you simply use the cached vectors you've acquired during the first phase, and can rerank without any I/O. There are two interesting nuances to this algorithm: - the working set for a given workload is quite large, since it must consist of the adjacency list, PQ vectors, _and full fidelity vectors_ of all nodes that are visited during graph traversal - there are dependencies between each I/O due to the I/Os being tied to sequential graph traversal If you instead do not store the vectors in the graph and do not cache them during traversal, the above two are a little different, and give some intuition into the performance of the pinned graph/PQ vectors with `io_uring`: - the working set for the first phase is relatively small, just the adjacency lists and PQ vectors of the nodes that are visited - there are no ordering dependencies between the I/Os for the second phase, and thus they can be done in parallel ### Results The following results are from indexes built against the [Cohere/wiki-22-12-en-embeddings](https://huggingface.co/datasets/Cohere/wikipedia-22-12-en-embeddings) data set (35 million vectors, 768 dimensions) using L2 distance. The indexes are all around ~100 GB. The performance tests were run on the same setup as the prior results (`c7gd.16xlarge` using docker to artificially constrain available memory) and all ran 8 query threads (each query thread executing its given query in a single thread). Recall was calculated using 10k test vectors that were excluded from the data set used to build the index, and performance tests were run using 100k vectors that were included in the data set used to build the index (a large number of test vectors were required to prevent a small working set from being established). Pinning parts of the Lucene Vamana index into memory was done using `mlock`. Some high level information about the indexes: | Type | Parameters | Recall | |-|-|-| | Lucene HNSW | maxConn: 16, beamWidth: 100 | 0.78 | | Lucene Vamana In-Graph Vectors | maxConn: 32, beamWidth: 100, alpha: 1.2, pqFactor: 8 | 0.88 | | Lucene Vamana Out-of-Graph Vectors | maxConn: 32, beamWidth: 100, alpha: 1.2, pqFactor: 8 | 0.88 | | JVector | maxConn: 16, beamWidth: 100, alpha: 1.2, pqFactor: 8 | 0.88 | NB: in Lucene HNSW `maxConn` configures all levels except level 0 and level 0 uses `2*maxConn`, in JVector the single layer graph uses `2*maxConn`, and in Lucene Vamana the single layer graph uses `maxConn`. Put differently, the bottom layer of the graphs all use the same `maxConn` value even though they appear to be configured differently. #### Fully in Memory | Type | Configuration | Average Latency | QPS | |-|-|-|-| | Lucene HNSW | N/A | 1.59 ms | 5030 | | Lucene Vamana | vectors: in-graph, rerank: cached | 1.36 ms | 5882 | | JVector | N/A | 2.35 ms | 3404 | | Lucene Vamana | vectors: out-of-graph, rerank: sequential | 1.31 ms | 6107 | | Lucene Vamana | vectors: out-of-graph, rerank: io-uring | 1.65 ms | 4848 | Charted QPS (left-to-right in the graphs matches top-to-bottom in the above table)  #### 16GB System Memory | Type | Configuration | Average Latency | QPS | |-|-|-|-| | Lucene HNSW | N/A | 205 ms | 39 | | Lucene Vamana | vectors: in-graph, rerank: cached | 13.4 ms | 597 | | JVector | N/A | 15.9 ms | 503 | | Lucene Vamana | vectors: out-of-graph, rerank: sequential | 13.7 ms | 583 | | Lucene Vamana | vectors: out-of-graph, rerank: io-uring, mlocked: false | 4.56 ms | 1754 | | Lucene Vamana | vectors: out-of-graph, rerank: io-uring, mlocked: true | 2.22 ms | 3603 | Charted QPS (left-to-right in the graphs matches top-to-bottom in the above table, Lucene HNSW is omitted from the graphs)  #### 8GB System Memory | Type | Configuration | Average Latency | QPS | |-|-|-|-| | Lucene HNSW | N/A | 280 ms | 28 | | Lucene Vamana | vectors: in-graph, rerank: cached | 17.3 ms | 462 | | JVector | N/A | 19.3 ms | 414 | | Lucene Vamana | vectors: out-of-graph, rerank: sequential | 24.1 ms | 331 | | Lucene Vamana | vectors: out-of-graph, rerank: io-uring, mlocked: false | 14.6 ms | 547 | | Lucene Vamana | vectors: out-of-graph, rerank: io-uring, mlocked: true | 2.47 ms | 3238 | Charted QPS (left-to-right in the graphs matches top-to-bottom in the above table, Lucene HNSW is omitted from the graphs)  ### Result Analysis High level takeaways: - Lucene HNSW performance suffers greatly when falling out of memory - DiskANN doesn't seem to lose any of the performance of HNSW when fully in memory, and may actually be faster - the original DiskANN algorithm provides improved performance and is not overly sensitive to the page cache's behavior - the modified DiskANN algorithm (not storing vectors in the graph) is more sensitive to the page cache - when doing sequential phase 2 I/O, performance was better than the original DiskANN algorithm with more memory (16GB) available, but worse when there was less (8GB) - when doing parallel phase 2 I/O, performance was always better than the original DIskANN algorithm, but the relative improvement and stability of performance was very sensitive to the graph and PQ vectors being in memory. pinning the graph and PQ vectors in memory provided excellent and consistent performance - using explicit I/O (in this case `io_uring`) is expensive when the page is already in the page cache - using `io_uring` while fully in memory is about 20% slower than accessing the vectors through `mmap` ### Other Notes - I was able to get similar performance by implementing parallel I/O by using `mmap` and dispatching the accesses across a thread pool when only running a single query at a time, but this scaled very poorly with concurrent queries - I tried implementing concurrent graph traversal as the DiskANN paper suggests (looking up the adjacency list of the top *L* candidates concurrently), but was unable to get any improvements - parallel I/O could be promising for SPANN, since there are similarly no ordering dependencies when retrieving vectors from the postings lists - all implementations *besides* the modified DiskANN with a pinned index and parallel rerank I/O are sensitive to the configured read ahead. the numbers reported above are using `--setra 8`. when using `--setra 128` instead performance looks like this (for 8GB system memory) | Type | Configuration | Average Latency | QPS | |-|-|-|-| | Lucene Vamana | vectors: in-graph, rerank: cached | 51.3 ms | 156 | | Lucene Vamana | vectors: out-of-graph, rerank: sequential | 46.9 ms | 170 | | Lucene Vamana | vectors: out-of-graph, rerank: io-uring, mlocked: false | 37.9 ms | 211 | | Lucene Vamana | vectors: out-of-graph, rerank: io-uring, mlocked: true | 2.49 ms | 3212 | -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org