kevindrosendahl commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1868095892

   Hi all, thanks for the patience, some interesting and exciting results to 
share.
   
   TL;DR:
   - DiskANN doesn't seem to lose any performance relative to HNSW when fully 
in memory, and may actually be faster
   - able to slightly beat JVector's larger-than-memory performance in Lucene 
using the same algorithm
   - able to get similar performance (sometimes better sometimes worse 
depending on the amount of available memory) when _not_ storing the full 
fidelity vectors in the graph, and doing the reranking without having cached 
the vectors during graph traversal
   - able to further improve performance ~7-10x by parallelizing the I/O during 
the reranking phase using [io_uring](https://en.wikipedia.org/wiki/Io_uring) 
when storing the vectors outside of the graph
   - with parallelized reranking and pinning the (relatively small) graph and 
PQ vectors in memory, against a 100 GB index latency when constraining memory 
to only 8GB is only about 40% slower than fully in memory lucene HNSW (and 
achieves ~10% higher recall given the tested configurations)
   - without pinning the graph and PQ vectors, the benefits of parallel I/O 
during reranking depend on how good the page cache is at keeping the graph and 
PQ vectors in memory. if they get paged out, the page faults during graph 
traversal can negate some or most of the gains that parallel I/O offers
   ### Algorithm Analysis
   The DiskANN algorithm consists of two phases:
   1. graph traversal using PQ vectors for distance calculations
   2. reranking the top *k* after finishing graph traversal by recalculating 
distance using the full fidelity vector instead of PQ vectors
   
   The original algorithm (and JVector's implementation) store the full 
fidelity vector and the adjacency list for a graph node together on disk. This 
means that when you retrieve the adjacency list during the first phase, the 
full fidelity vector will also be in memory due to locality. Immediately after 
retrieving the adjacency list, you cache the full fidelity vector. When 
re-ranking in phase 2, you simply use the cached vectors you've acquired during 
the first phase, and can rerank without any I/O.
   
   There are two interesting nuances to this algorithm:
   - the working set for a given workload is quite large, since it must consist 
of the adjacency list, PQ vectors, _and full fidelity vectors_ of all nodes 
that are visited during graph traversal
   - there are dependencies between each I/O due to the I/Os being tied to 
sequential graph traversal
   
   If you instead do not store the vectors in the graph and do not cache them 
during traversal, the above two are a little different, and give some intuition 
into the performance of the pinned graph/PQ vectors with `io_uring`:
   - the working set for the first phase is relatively small, just the 
adjacency lists and PQ vectors of the nodes that are visited
   - there are no ordering dependencies between the I/Os for the second phase, 
and thus they can be done in parallel
   
   ### Results
   The following results are from indexes built against the 
[Cohere/wiki-22-12-en-embeddings](https://huggingface.co/datasets/Cohere/wikipedia-22-12-en-embeddings)
 data set (35 million vectors, 768 dimensions) using L2 distance. The indexes 
are all around ~100 GB. The performance tests were run on the same setup as the 
prior results (`c7gd.16xlarge` using docker to artificially constrain available 
memory) and all ran 8 query threads (each query thread executing its given 
query in a single thread). Recall was calculated using 10k test vectors that 
were excluded from the data set used to build the index, and performance tests 
were run using 100k vectors that were included in the data set used to build 
the index (a large number of test vectors were required to prevent a small 
working set from being established). Pinning parts of the Lucene Vamana index 
into memory was done using `mlock`.
   
   Some high level information about the indexes:
   
   | Type | Parameters | Recall |
   |-|-|-|
   | Lucene HNSW | maxConn: 16, beamWidth: 100 | 0.78 |
   | Lucene Vamana In-Graph Vectors | maxConn: 32, beamWidth: 100, alpha: 1.2, 
pqFactor: 8 | 0.88 |
   | Lucene Vamana Out-of-Graph Vectors | maxConn: 32, beamWidth: 100, alpha: 
1.2, pqFactor: 8 | 0.88 |
   | JVector | maxConn: 16, beamWidth: 100, alpha: 1.2, pqFactor: 8 | 0.88 |
   
   NB: in Lucene HNSW `maxConn` configures all levels except level 0 and level 
0 uses `2*maxConn`, in JVector the single layer graph uses `2*maxConn`, and in 
Lucene Vamana the single layer graph uses `maxConn`. Put differently, the 
bottom layer of the graphs all use the same `maxConn` value even though they 
appear to be configured differently.
   
   #### Fully in Memory
   
   | Type | Configuration | Average Latency | QPS | 
   |-|-|-|-|
   | Lucene HNSW | N/A | 1.59 ms | 5030 |
   | Lucene Vamana | vectors: in-graph, rerank: cached | 1.36 ms | 5882 |
   | JVector | N/A | 2.35 ms | 3404 |
   | Lucene Vamana | vectors: out-of-graph, rerank: sequential | 1.31 ms | 6107 
|
   | Lucene Vamana | vectors: out-of-graph, rerank: io-uring | 1.65 ms | 4848 |
   
   Charted QPS (left-to-right in the graphs matches top-to-bottom in the above 
table)
   
![lucene-diskann-fully-in-memory](https://github.com/apache/lucene/assets/3181256/3ff46869-e439-45de-9c1b-e9c80f5f7a5e)
   
   #### 16GB System Memory
   
   | Type | Configuration | Average Latency | QPS |
   |-|-|-|-|
   | Lucene HNSW | N/A | 205 ms | 39 |
   | Lucene Vamana | vectors: in-graph, rerank: cached | 13.4 ms | 597 |
   | JVector | N/A | 15.9 ms | 503 | 
   | Lucene Vamana | vectors: out-of-graph, rerank: sequential | 13.7 ms | 583 |
   | Lucene Vamana | vectors: out-of-graph, rerank: io-uring, mlocked: false | 
4.56 ms | 1754 |
   | Lucene Vamana | vectors: out-of-graph, rerank: io-uring, mlocked: true | 
2.22 ms | 3603 |
   
   Charted QPS (left-to-right in the graphs matches top-to-bottom in the above 
table, Lucene HNSW is omitted from the graphs)
   
![lucene-diskann-16gb](https://github.com/apache/lucene/assets/3181256/e7c4e593-c0a3-408a-aef3-303f416cc37d)
   
   #### 8GB System Memory
   
   | Type | Configuration | Average Latency | QPS |
   |-|-|-|-|
   | Lucene HNSW | N/A | 280 ms | 28 |
   | Lucene Vamana | vectors: in-graph, rerank: cached | 17.3 ms | 462 |
   | JVector | N/A | 19.3 ms | 414 | 
   | Lucene Vamana | vectors: out-of-graph, rerank: sequential | 24.1 ms | 331 |
   | Lucene Vamana | vectors: out-of-graph, rerank: io-uring, mlocked: false | 
14.6 ms | 547 |
   | Lucene Vamana | vectors: out-of-graph, rerank: io-uring, mlocked: true | 
2.47 ms | 3238 |
   
   Charted QPS (left-to-right in the graphs matches top-to-bottom in the above 
table, Lucene HNSW is omitted from the graphs)
   
![lucene-diskann-8gb](https://github.com/apache/lucene/assets/3181256/2a84dd39-c68d-4fab-abed-17ee9ef770a4)
   
   ### Result Analysis
   High level takeaways:
   - Lucene HNSW performance suffers greatly when falling out of memory
   - DiskANN doesn't seem to lose any of the performance of HNSW when fully in 
memory, and may actually be faster
   - the original DiskANN algorithm provides improved performance and is not 
overly sensitive to the page cache's behavior
   - the modified DiskANN algorithm (not storing vectors in the graph) is more 
sensitive to the page cache
        - when doing sequential phase 2 I/O, performance was better than the 
original DiskANN algorithm with more memory (16GB) available, but worse when 
there was less (8GB)
        - when doing parallel phase 2 I/O, performance was always better than 
the original DIskANN algorithm, but the relative improvement and stability of 
performance was very sensitive to the graph and PQ vectors being in memory. 
pinning the graph and PQ vectors in memory provided excellent and consistent 
performance
   - using explicit I/O (in this case `io_uring`) is expensive when the page is 
already in the page cache
        - using `io_uring` while fully in memory is about 20% slower than 
accessing the vectors through `mmap`
   
   ### Other Notes
   - I was able to get similar performance by implementing parallel I/O by 
using `mmap` and dispatching the accesses across a thread pool when only 
running a single query at a time, but this scaled very poorly with concurrent 
queries
   - I tried implementing concurrent graph traversal as the DiskANN paper 
suggests (looking up the adjacency list of the top *L* candidates 
concurrently), but was unable to get any improvements
   - parallel I/O could be promising for SPANN, since there are similarly no 
ordering dependencies when retrieving vectors from the postings lists
   - all implementations *besides* the modified DiskANN with a pinned index and 
parallel rerank I/O are sensitive to the configured read ahead. the numbers 
reported above are using `--setra 8`. when using `--setra 128` instead 
performance looks like this (for 8GB system memory)
   
   | Type | Configuration | Average Latency | QPS |
   |-|-|-|-|
   | Lucene Vamana | vectors: in-graph, rerank: cached | 51.3 ms | 156 |
   | Lucene Vamana | vectors: out-of-graph, rerank: sequential | 46.9 ms | 170 |
   | Lucene Vamana | vectors: out-of-graph, rerank: io-uring, mlocked: false | 
37.9 ms | 211 |
   | Lucene Vamana | vectors: out-of-graph, rerank: io-uring, mlocked: true | 
2.49 ms | 3212 |
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to