RKSPD opened a new pull request, #14892: URL: https://github.com/apache/lucene/pull/14892
## Motivation Lucene’s built‑in HNSW KnnVectorsFormat delivers strong recall/latency, but its index must reside entirely in RAM. As demand for vector datasets of larger dimensionality and greater index size increases, the cost of scaling systems like HNSW become prohibitively expensive. JVector is a pure‑Java ANN engine that ultimately aims to merge DiskANN’s disk‑resident search with HNSW’s navigable‑small‑world graph. Today the library still loads the whole graph in RAM (like plain HNSW), but its public roadmap is moving toward split‑layer storage where only the upper graph levels live in memory and deeper layers + raw vectors remain on disk. OpenSearch has successfully integrated JVector through the OpenSearch-JVector repository, but the current implementation contains several OpenSearch-specific dependencies. As OpenSearch continues to develop new features and optimization to their codec, this implementation allows the continual development and testing of those features in Lucene itself. As such, with this PR, I will also include a link to a luceneutil-jvector repository that works with the proposed JVector codec without significant modifications. ## Dependency Information * **`io.github.jbellis:jvector:4.0.0-beta.6`** – the ANN engine (automatic module `jvector`) * **`org.agrona:agrona:1.20.0`** – off-heap buffer utilities * **`org.apache.commons:commons-math3:3.6.1`** – PQ math helpers * **`org.yaml:snakeyaml:2.4`** – only needed if you load YAML tuning files * **`org.slf4j:slf4j-api:2.0.17`** – logging façade (overrides JVector’s 2.0.16 to match the rest of Lucene) * *All jars have matching LICENSE/NOTICE entries added under `lucene/licenses/`* ## Vector Codec – design highlights *Per-segment, per-field indexes* Each Lucene segment owns its own JVector graph index. The graph payloads live in a single *.data-jvector file and the per-field metadata lives in a companion *.meta-jvector file, mirroring Lucene’s existing *.vec/ *.vex layout *Bulk build at flush time* Vectors are streamed into the ordinary flat-vector writer while an in-memory OnHeapGraphIndex is built. When the segment flushes, the whole graph (and optional Product-Quantization code-books) is handed to OnDiskSequentialGraphIndexWriter and serialized to disk in one pass *Single data file, concatenated fields* All field-specific graphs (and PQ blobs) are appended one after another inside *.data-jvector; their start-offsets, lengths and build parameters are recorded in *.meta-jvector so the reader can jump straight to the right slice *Zero-copy loading on open* JVectorReader memory-maps the data file and spawns a lightweight OnDiskGraphIndex for each field via ReaderSupplier. No temp files are created; the mmap’d bytes are shared across threads and searches *Pure-Java search path* At query time the float vector is passed directly to GraphSearcher (DiskANN-style). Results are optionally re-ranked with an exact scorer, then surfaced through a thin JVectorKnnCollector wrapper so the rest of Lucene sees a normal TopDocs *Ordinal → doc-ID mapping still in Lucene* JVector returns internal ordinals; we convert them to docIDs using Lucene’s existing ordinal map during collection. ## Initial Benchmark Results ### Small Corpus Testing (Wikipedia Cohere 768, 200k docs) ``` Results: Lucene recall latency(ms) netCPU avgCpuCount nDoc topK fanout maxConn beamWidth quantized index(s) index_docs/s force_merge(s) num_segments index_size(MB) overSample vec_disk(MB) vec_RAM(MB) indexType 0.803 2.800 2.343 0.837 200000 100 300 12 16 7 bits 8.46 23646.25 7.58 1 736.49 1.000 733.185 147.247 HNSW 0.822 2.486 2.286 0.920 200000 100 300 12 20 7 bits 7.33 27273.97 7.45 1 736.76 1.000 733.185 147.247 HNSW 0.857 2.657 2.429 0.914 200000 100 300 12 28 7 bits 13.64 14658.46 8.97 1 737.15 1.000 733.185 147.247 HNSW 0.831 2.771 2.514 0.907 200000 100 300 16 16 7 bits 6.44 31075.20 7.42 1 736.61 1.000 733.185 147.247 HNSW 0.846 2.857 2.571 0.900 200000 100 300 16 20 7 bits 7.19 27812.54 8.42 1 736.86 1.000 733.185 147.247 HNSW 0.869 3.029 2.657 0.877 200000 100 300 16 28 7 bits 8.47 23626.70 10.04 1 737.17 1.000 733.185 147.247 HNSW 0.847 2.829 2.486 0.879 200000 100 300 20 16 7 bits 6.11 32717.16 7.05 1 736.68 1.000 733.185 147.247 HNSW 0.862 2.743 2.429 0.885 200000 100 300 20 20 7 bits 6.92 28893.38 8.13 1 736.88 1.000 733.185 147.247 HNSW 0.883 3.086 2.743 0.889 200000 100 300 20 28 7 bits 7.94 25176.23 8.90 1 737.26 1.000 733.185 147.247 HNSW 0.860 2.943 2.657 0.903 200000 100 300 24 16 7 bits 9.37 21342.44 7.21 1 736.69 1.000 733.185 147.247 HNSW 0.880 3.371 3.143 0.932 200000 100 300 24 20 7 bits 7.77 25749.97 8.38 1 736.92 1.000 733.185 147.247 HNSW 0.900 3.086 2.886 0.935 200000 100 300 24 28 7 bits 8.37 23900.57 9.70 1 737.29 1.000 733.185 147.247 HNSW ``` ``` Results: JVector recall latency(ms) netCPU avgCpuCount nDoc topK fanout maxConn beamWidth quantized index(s) index_docs/s force_merge(s) num_segments index_size(MB) vec_disk(MB) vec_RAM(MB) indexType 0.877 3.943 3.714 0.942 200000 100 300 12 16 7 bits 12.94 15458.34 101.28 1 1197.28 733.185 147.247 HNSW 0.901 3.771 3.629 0.962 200000 100 300 12 20 7 bits 13.89 14394.70 123.37 1 1197.28 733.185 147.247 HNSW 0.913 3.457 3.314 0.959 200000 100 300 12 28 7 bits 18.52 10802.05 136.17 1 1197.28 733.185 147.247 HNSW 0.915 3.743 3.571 0.954 200000 100 300 16 16 7 bits 15.16 13193.48 118.83 1 1200.28 733.185 147.247 HNSW 0.921 4.029 3.857 0.957 200000 100 300 16 20 7 bits 18.83 10620.22 134.91 1 1200.28 733.185 147.247 HNSW 0.931 3.886 3.714 0.956 200000 100 300 16 28 7 bits 22.87 8746.61 174.35 1 1200.28 733.185 147.247 HNSW 0.921 5.400 5.257 0.974 200000 100 300 20 16 7 bits 15.68 12758.36 126.82 1 1203.30 733.185 147.247 HNSW 0.929 4.229 4.057 0.959 200000 100 300 20 20 7 bits 19.68 10161.57 152.86 1 1203.30 733.185 147.247 HNSW 0.942 4.343 4.171 0.961 200000 100 300 20 28 7 bits 27.79 7197.35 212.50 1 1203.30 733.185 147.247 HNSW 0.930 4.257 4.086 0.960 200000 100 300 24 16 7 bits 17.47 11449.51 131.11 1 1206.33 733.185 147.247 HNSW 0.943 4.314 4.143 0.960 200000 100 300 24 20 7 bits 21.34 9371.63 162.54 1 1206.33 733.185 147.247 HNSW 0.940 4.914 4.743 0.965 200000 100 300 24 28 7 bits 29.75 6722.24 235.78 1 1206.33 733.185 147.247 HNSW ``` # Testing JVectorCodec Using luceneutil-jvector This guide provides step-by-step instructions for benchmarking and testing JVectorCodec performance using the luceneutil-jvector testing framework. ## Prerequisites * Java development environment with Gradle support * Python 3.x installed * Git installed * SSD storage recommended for optimal performance ## Setup Instructions ### 1. Environment Preparation Create a benchmark directory on an SSD for optimal I/O performance: ``` mkdir LUCENE_BENCH_HOME cd LUCENE_BENCH_HOME ``` ### 2. Repository Cloning Clone the required repositories: ``` git clone https://github.com/RKSPD/lucene-jvector lucene_candidate git clone https://github.com/RKSPD/luceneutil-jvector util ``` **Note:** The `lucene-jvector` repository contains the same code as the PR under review. ### 3. Initial Setup and Data Download Navigate to the utilities directory and run the initial setup: ``` cd util python3 src/python/initial_setup.py -d ``` This command will download the necessary test datasets. The download process may take some time depending on your internet connection. ### 4. Lucene Build While the data is downloading, open a new terminal session and build Lucene: ``` cd LUCENE_BENCH_HOME/lucene_candidate ./gradlew build ``` ## Running Performance Tests ### 5. Initial Test Run Once both the build and download processes are complete, navigate back to the utilities directory: ``` cd LUCENE_BENCH_HOME/util ``` Run the KNN performance test: ``` ./gradlew runKnnPerfTest ``` **Important:** The first execution will fail as expected. This initial run generates the path definitions for your Lucene repository and determines the Lucene version. ### 6. Successful Test Execution Run the performance test a second time: ``` ./gradlew runKnnPerfTest ``` This execution should complete successfully and provide performance metrics. ## Configuration and Tuning ### 7. Parameter Customization To customize the testing parameters for your specific benchmarking needs: #### Merge Policy Configuration * **File:** `util/src/main/knn/KnnIndexer.java` * **Purpose:** Configure the merge policy for index optimization #### Codec Configuration * **File:** `util/src/main/knn/KnnGraphTester.java` * **Method:** `getCodec()` * **Purpose:** Specify which codec implementation to test #### Performance Test Parameters * **File:** `src/python/knnPerfTest.py` * **Section:** `params` block * **Purpose:** Adjust various performance testing parameters including: * Vector dimensions * Index size * Query parameters * Recall targets * Other algorithm-specific settings ## Expected Outcomes Upon successful completion, you will have: * A fully configured benchmarking environment * Performance metrics comparing JVectorCodec against baseline implementations * Configurable parameters for comprehensive testing scenarios ## Troubleshooting * Ensure sufficient disk space for dataset downloads and index generation * Verify Java and Python environments are properly configured * Check network connectivity if initial setup fails during download phase * Confirm SSD usage for optimal I/O performance during benchmarking ## Long-Term Considerations **Split-layer storage roadmap** * JVector aims to only keep upper graph levels in RAM while deeper layers and raw vectors live on disk. Plan for API changes and configuration knobs as this feature stabilizes. **Backwards compatibility with previous JVector implementations** * As the codec changes, there’s no guarantee whether indexes generated in past JVectorCodec implementations will work with new version of JVector. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org