kevindrosendahl commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1806615864

   I've got my framework set up for testing larger than memory indexes and have 
some somewhat interesting first results.
   
   TL;DR:
   - the main thing driving jvector's larger-than-memory performance is keeping 
product-quantized vectors in memory, which avoids the need for I/O to do 
distance calculations. This is reflected in ~24x less page faults, and a 
similar improvement in query latency
   - for this dataset, introducing PQ does not have a negative effect on recall 
until reaching a factor of 16, after which recall begins to drop off. recall 
actually slightly improves with each factor increase up to 4, and does not 
materially change at 8
   - storing the vectors inline in the graph does not seem to have a large 
impact on larger than memory performance
   - other differences like having a cache of neighbors close to the entry 
point or storing offsets in memory vs having consistent offsets are marginal 
differences, and don't account for the order of magnitude improvement
   
   Next steps:
   - measure lucene performance with scalar quantization
   - incorporate PQ into the lucene vamana implementation
   - explore SPANN to see how it performs relative to these implementations
   ### Setup
   I used the 
[Cohere/wikipedia-22-12-es-embeddings](https://huggingface.co/datasets/Cohere/wikipedia-22-12-es-embeddings)
 (768 dimensions) with L2 similarity for these. I split the dataset into 10,000 
vectors randomly sampled from the dataset as the test set, with the remaining 
10,114,929 vectors as the training set.
   
   I built and ran the query tests on a `c7gd.16xlarge` (128 GB memory, 64 
threads. Thank you for the addition of concurrent merging, it helped speed up 
the builds drastically!). The queries with restricted system memory were done 
by purging the host page cache, then running in a docker container with `-m 
8g`, a Coretto Java 21 JVM with `-Xms4g` and `-Xmx4g` and the Panama vector API 
enabled, and with the dataset and indexes bind mounted into the container. 
There is a warmup phase of 2 iterations of the 10,000 test vectors followed by 
3 measured iterations of the 10,000 vectors, all iterations running 48 queries 
at a time with a single thread per query.
   
   Note that the indexes aren't quite apples-to-apples, as the Lucene HNSW 
index configuration (`maxConn: 16`, `beamWidth: 100`) achieves 0.8247 recall 
(and slightly better latency when in memory) whereas both Vamana configurations 
(`M: 32`, `beamWidth: 100`, `alpha: 1.2`) achieve ~0.91 recall, but the broad 
strokes are obvious enough to draw some conclusions.
   
   #### Index size
   | configuration | size | breakdown |
   |-|-|-|
   | lucene hnsw | 31.42 GB  | 31.07 GB vectors + 0.35 GB graph |
   | lucene vamana | 31.73 GB | 31.73 GB graph with vectors |
   | jvector pq=0 | 32.45 GB | 32.45 GB graph with vectors |
   | jvector pq=2 | 36.33 GB | 32.45 GB graph with vectors + 3.88 GB quantized 
vectors |
   | jvector pq=4 | 34.39 GB | 32.45 GB graph with vectors + 1.94 GB quantized 
vectors |
   | jvector pq=8 | 33.42 GB | 32.45 GB graph with vectors + 0.97 GB quantized 
vectors |
   | jvector pq=16 | 32.93 GB | 32.45 GB graph with vectors + 0.48 GB quantized 
vectors |
   
   #### Queries fully in memory (124 GB of RAM available)
   |configuration | recall | average duration |
   |-|-|-|
   | lucene hnsw | 0.8247 | 0.00187s |
   | lucene vamana | 0.9086 | 0.00257s |
   | jvector pq=0 | 0.9108 | 0.00495s |
   | jvector pq=2 | 0.9109 | 0.00259s |
   | jvector pq=4 | 0.9122 | 0.00291s |
   | jvector pq=8 | 0.9121 | 0.00148s |
   | jvector pq=16 | 0.8467 | 0.0012s |
   
   #### Queries with  8 GB system memory, 4 GB heap
   |configuration | recall | average duration | average page faults | total 
page faults|
   |-|-|-|-|-|
   | lucene hnsw | 0.8247 | 2.950s | 1464.65 | 4.39395E7 | 
   | lucene vamana | 0.9086 | 3.122s | 1662.26 | 4.9867932E7 |
   | jvector pq=0 | 0.9108 | 3.651s | 1716.03 | 5.1481117E7 |
   | jvector pq=2 | 0.9109 | 0.127s | 69.65 | 2089449 |
   | jvector pq=4 | 0.9122 | 0.131s | 68.94 | 2068274 |
   | jvector pq=8 | 0.9121 | 0.119s | 70.35 | 2110418 |
   | jvector pq=16 | 0.8467 | 0.126s | 72.64 | 2179330 |
   
   #### Conclusions
   |conclusion|reasoning|
   |-|-|
   | storing the vectors inline in the graph does not seem to have a large 
impact on larger than memory performance | lucene hnsw and lucene vamana 
performance is in the same order of magnitude with simlar numbers of page 
faults |
   | jvector's neighbor cache and consistent offsets are not large determinants 
in improving larger than memory performance | lucene vamana (which has no cache 
and in-memory offsets) and jvector vamana `pq=0` performance is in the same 
order of magnitude |
   | pq is the largest determinant in improving performance | jvector `pq=2` 
performance is ~28x better than jvector `pq=0` performance, and all `pq=(n != 
0)` performance is similar |
   | introducing pq does not immediately impact recall on this dataset | recall 
actually improves when introducing pq, and only starts to decrease at a factor 
of 16 |
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to