abernardi597 opened a new pull request, #15472: URL: https://github.com/apache/lucene/pull/15472
### Description I took a stab at bringing the [OpenSearch JVector codec](https://github.com/opensearch-project/opensearch-jvector) into Lucene as a codec in `sandbox` (see issue #14681) to see how a DiskANN-insipired index might compare to the current generation of HNSW. I made quite a few changes along the way and wanted to cut this PR to share some of those changes/results and maybe solicit some feedback from interested parties. Most notably, I did remove the incremental graph building functionality that is used to speed up merges, though I'd like to add it back and look at the improvements in merge-time for JVector indices. This PR is not really intended to be merged, in light of some of the feedback on the previous PR (#14892) that suggests Lucene should try to incorporate some of the learnings rather than add yet another KNN engine. I hooked it up to `lucene-util` (PR incoming) for comparison, trying to play into the strengths of each codec while also maintaining similar levels of parallelism. I ran HNSW using 32x indexing threads and force-merging into 1 segment while using 1x indexing thread for JVector backed by a 32x concurrency `ForkJoinPool` for its SIMD operations and `ForkJoinPool.commonPool()` for its other parallel operations. I also fixed `oversample=1` for both and used `neighborOverflow=2` and `alpha=2` for JVector. These results are from the 768-dim cohere dataset. | recall | latency(ms) | netCPU | avgCpuCount | nDoc | topK | fanout | maxConn | beamWidth | quantized | visited | index(s) | index_docs/s | force_merge(s) | num_segments | index_size(MB) | vec_disk(MB) | vec_RAM(MB) | indexType | metric | |--------|-------------|--------|-------------|--------|------|--------|---------|-----------|-----------|---------|----------|--------------|----------------|--------------|----------------|--------------|-------------|-----------|--------| | 0.965 | 1.408 | 1.399 | 0.994 | 100000 | 100 | 50 | 64 | 250 | no | 4968 | 5.99 | 16700.07 | 10.10 | 1 | 298.17 | 292.969 | 292.969 | HNSW | COSINE | | 0.939 | 2.186 | 2.155 | 0.986 | 100000 | 100 | 50 | 64 | 250 | no | 3485 | 19.58 | 5107.77 | 0.01 | 1 | 318.80 | 292.969 | 292.969 | JVECTOR | COSINE | | 0.963 | 1.409 | 1.401 | 0.994 | 100000 | 100 | 50 | 64 | 250 | 8 bits | 5028 | 8.75 | 11431.18 | 12.95 | 1 | 372.84 | 367.737 | 74.768 | HNSW | COSINE | | 0.939 | 9.524 | 9.516 | 0.999 | 100000 | 100 | 50 | 64 | 250 | 8 bits | 3525 | 886.28 | 112.83 | 0.01 | 1 | 392.79 | 367.737 | 74.768 | JVECTOR | COSINE | | 0.899 | 0.967 | 0.959 | 0.992 | 100000 | 100 | 50 | 64 | 250 | 4 bits | 5076 | 8.84 | 11314.78 | 9.07 | 1 | 335.80 | 331.116 | 38.147 | HNSW | COSINE | | 0.937 | 3.469 | 3.457 | 0.997 | 100000 | 100 | 50 | 64 | 250 | 4 bits | 3437 | 148.70 | 672.51 | 0.01 | 1 | 356.17 | 331.116 | 38.147 | JVECTOR | COSINE | | 0.669 | 0.681 | 0.673 | 0.988 | 100000 | 100 | 50 | 64 | 250 | 1 bits | 5895 | 8.04 | 12439.36 | 8.84 | 1 | 308.42 | 303.459 | 10.490 | HNSW | COSINE | | 0.730 | 1.056 | 1.044 | 0.989 | 100000 | 100 | 50 | 64 | 250 | 1 bits | 2672 | 51.39 | 1945.90 | 0.01 | 1 | 328.70 | 303.459 | 10.490 | JVECTOR | COSINE | -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
