kevindrosendahl commented on issue #12615: URL: https://github.com/apache/lucene/issues/12615#issuecomment-1806615864
I've got my framework set up for testing larger than memory indexes and have some somewhat interesting first results. TL;DR: - the main thing driving jvector's larger-than-memory performance is keeping product-quantized vectors in memory, which avoids the need for I/O to do distance calculations. This is reflected in ~24x less page faults, and a similar improvement in query latency - for this dataset, introducing PQ does not have a negative effect on recall until reaching a factor of 16, after which recall begins to drop off. recall actually slightly improves with each factor increase up to 4, and does not materially change at 8 - storing the vectors inline in the graph does not seem to have a large impact on larger than memory performance - other differences like having a cache of neighbors close to the entry point or storing offsets in memory vs having consistent offsets are marginal differences, and don't account for the order of magnitude improvement Next steps: - measure lucene performance with scalar quantization - incorporate PQ into the lucene vamana implementation - explore SPANN to see how it performs relative to these implementations ### Setup I used the [Cohere/wikipedia-22-12-es-embeddings](https://huggingface.co/datasets/Cohere/wikipedia-22-12-es-embeddings) (768 dimensions) with L2 similarity for these. I split the dataset into 10,000 vectors randomly sampled from the dataset as the test set, with the remaining 10,114,929 vectors as the training set. I built and ran the query tests on a `c7gd.16xlarge` (128 GB memory, 64 threads. Thank you for the addition of concurrent merging, it helped speed up the builds drastically!). The queries with restricted system memory were done by purging the host page cache, then running in a docker container with `-m 8g`, a Coretto Java 21 JVM with `-Xms4g` and `-Xmx4g` and the Panama vector API enabled, and with the dataset and indexes bind mounted into the container. There is a warmup phase of 2 iterations of the 10,000 test vectors followed by 3 measured iterations of the 10,000 vectors, all iterations running 48 queries at a time with a single thread per query. Note that the indexes aren't quite apples-to-apples, as the Lucene HNSW index configuration (`maxConn: 16`, `beamWidth: 100`) achieves 0.8247 recall (and slightly better latency when in memory) whereas both Vamana configurations (`M: 32`, `beamWidth: 100`, `alpha: 1.2`) achieve ~0.91 recall, but the broad strokes are obvious enough to draw some conclusions. #### Index size | configuration | size | breakdown | |-|-|-| | lucene hnsw | 31.42 GB | 31.07 GB vectors + 0.35 GB graph | | lucene vamana | 31.73 GB | 31.73 GB graph with vectors | | jvector pq=0 | 32.45 GB | 32.45 GB graph with vectors | | jvector pq=2 | 36.33 GB | 32.45 GB graph with vectors + 3.88 GB quantized vectors | | jvector pq=4 | 34.39 GB | 32.45 GB graph with vectors + 1.94 GB quantized vectors | | jvector pq=8 | 33.42 GB | 32.45 GB graph with vectors + 0.97 GB quantized vectors | | jvector pq=16 | 32.93 GB | 32.45 GB graph with vectors + 0.48 GB quantized vectors | #### Queries fully in memory (124 GB of RAM available) |configuration | recall | average duration | |-|-|-| | lucene hnsw | 0.8247 | 0.00187s | | lucene vamana | 0.9086 | 0.00257s | | jvector pq=0 | 0.9108 | 0.00495s | | jvector pq=2 | 0.9109 | 0.00259s | | jvector pq=4 | 0.9122 | 0.00291s | | jvector pq=8 | 0.9121 | 0.00148s | | jvector pq=16 | 0.8467 | 0.0012s | #### Queries with 8 GB system memory, 4 GB heap |configuration | recall | average duration | average page faults | total page faults| |-|-|-|-|-| | lucene hnsw | 0.8247 | 2.950s | 1464.65 | 4.39395E7 | | lucene vamana | 0.9086 | 3.122s | 1662.26 | 4.9867932E7 | | jvector pq=0 | 0.9108 | 3.651s | 1716.03 | 5.1481117E7 | | jvector pq=2 | 0.9109 | 0.127s | 69.65 | 2089449 | | jvector pq=4 | 0.9122 | 0.131s | 68.94 | 2068274 | | jvector pq=8 | 0.9121 | 0.119s | 70.35 | 2110418 | | jvector pq=16 | 0.8467 | 0.126s | 72.64 | 2179330 | #### Conclusions |conclusion|reasoning| |-|-| | storing the vectors inline in the graph does not seem to have a large impact on larger than memory performance | lucene hnsw and lucene vamana performance is in the same order of magnitude with simlar numbers of page faults | | jvector's neighbor cache and consistent offsets are not large determinants in improving larger than memory performance | lucene vamana (which has no cache and in-memory offsets) and jvector vamana `pq=0` performance is in the same order of magnitude | | pq is the largest determinant in improving performance | jvector `pq=2` performance is ~28x better than jvector `pq=0` performance, and all `pq=(n != 0)` performance is similar | | introducing pq does not immediately impact recall on this dataset | recall actually improves when introducing pq, and only starts to decrease at a factor of 16 | -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org