abernardi597 commented on PR #15472: URL: https://github.com/apache/lucene/pull/15472#issuecomment-3647265760
> Lucene's HNSW merging has exactly this optimization I've been working on some modifications to further align the two implementations. For example, I have added changes to do single-threaded graph construction on the indexing thread (instead of buffering all the docs until building the graph in parallel at flush-time). I am working on this graph-reuse bit, though it looks like Lucene also does a smart merge where it inserts key nodes from the smaller graph such that it can re-use adjacency information from the small graph to seed the graph search when inserting the remaining nodes. JVector does not do this at the moment, but would likely benefit from such a change (possibly as an upstream contribution). > how does your PR here compare to that original PR? I looked at the original PR as a starting point, but found that there were several key changes in the upstream OpenSearch implementation that could be brought in. Merging those commits seemed unwieldy, so I opted to start from scratch by [checking out the codec](https://github.com/apache/lucene/pull/15472/commits/7843b82f6cdc1f3075a2ff5f489039f20c2c7c0a) into the `sandbox`. Then I fixed the build and style issues before making some changes to how the codec actually works to get more functional parity with Lucene's HNSW codecs. Specifically trying to get the extra KNN tests passing and moving towards the single-indexing-thread model as I mentioned above. > Does Lucene's HNSW have an analogue for neighborOverflow=2 and alpha=2 We have found that `alpha=2` is actually partially responsible for the increase in index/search time. `alpha > 1` is a hyper-parameter that relaxes the diversity check by a multiplicative factor, with `alpha=1` being the same diversity check as HNSW. We found that `alpha=2` resulted in graphs with every node saturated with edges (`maxConn` edges), which was really slowing down the construction and search. There is also a `hierarchyEnabled` flag that adds layers to the graph in much the same fashion as the `H` in HNSW. Enabling the hierarchy with `alpha=1` and also allowing `2*maxConn` for `level=0` gives somewhat more promising results: recall | latency(ms) | netCPU | avgCpuCount | nDoc | topK | fanout | maxConn | beamWidth | quantized | visited | index(s) | index_doc/s | force_merge(s) | num_segments | index_size(MB) | vec_disk(MB) | vec_RAM(MB) | indexType | metric -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- 0.926 | 2.262 | 2.2 | 0.973 | 100000 | 100 | 50 | 64 | 250 | no | 3277 | 12.07 | 8282.26 | 0.01 | 1 | 319.21 | 292.969 | 292.969 | JVECTOR | COSINE 0.926 | 10.106 | 9.95 | 0.985 | 100000 | 100 | 50 | 64 | 250 | 8 bits | 3238 | 196.74 | 508.27 | 0.01 | 1 | 393.46 | 367.737 | 74.768 | JVECTOR | COSINE 0.926 | 3.444 | 3.386 | 0.983 | 100000 | 100 | 50 | 64 | 250 | 4 bits | 3189 | 75.32 | 1327.6 | 0.01 | 1 | 356.71 | 331.116 | 38.147 | JVECTOR | COSINE 0.739 | 1.15 | 1.122 | 0.976 | 100000 | 100 | 50 | 64 | 250 | 1 bits | 2581 | 22.24 | 4496.4 | 0.01 | 1 | 329.15 | 303.459 | 10.49 | JVECTOR | COSINE Combining this with the single-thread-indexing mentioned above lets me run a more apples-to-apples test, with 32x indexing threads and 32x merge threads with a final force-merge for both codecs: recall | latency(ms) | netCPU | avgCpuCount | numDoc | topK | fanout | maxConn | beamWidth | quantized | visited | index(s) | index_doc/s | force_merge(s) | total_index(s) | num_segments | index_size(MB) | vec_disk(MB) | vec_RAM(MB) | indexType -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- 0.960 | 1.796 | 1.764 | 0.982 | 200000 | 100 | 50 | 64 | 250 | no | 5596 | 12.81 | 15612.80 | 24.58 | 37.39 | 1 | 596.91 | 585.938 | 585.938 | HNSW 0.904 | 2.416 | 2.371 | 0.981 | 200000 | 100 | 50 | 64 | 250 | no | 3321 | 15.42 | 12972.69 | 73.05 | 88.47 | 1 | 686.89 | 585.938 | 585.938 | JVECTOR 0.894 | 1.391 | 1.363 | 0.980 | 200000 | 100 | 50 | 64 | 250 | 4 bits | 5661 | 18.55 | 10784.00 | 21.93 | 40.48 | 1 | 672.01 | 662.231 | 76.294 | HNSW 0.903 | 3.923 | 3.862 | 0.984 | 200000 | 100 | 50 | 64 | 250 | 4 bits | 3274 | 15.56 | 12850.99 | 107.83 | 123.39 | 1 | 760.87 | 662.231 | 76.294 | JVECTOR 0.661 | 0.887 | 0.867 | 0.977 | 200000 | 100 | 50 | 64 | 250 | 1 bits | 6552 | 17.25 | 11594.20 | 19.71 | 36.96 | 1 | 617.32 | 606.918 | 20.981 | HNSW 0.724 | 1.252 | 1.229 | 0.982 | 200000 | 100 | 50 | 64 | 250 | 1 bits | 2704 | 15.52 | 12888.26 | 35.89 | 51.41 | 1 | 705.95 | 606.918 | 20.981 | JVECTOR I'm nearly at a point where I can re-use the largest graph at merge-time, but I'm working through an elusive duplicate neighbor bug. > making it easy-ish for anyone to benchmark jVector against Faiss wrapped Codec and Lucene's HNSW implementation would be awesome Apologies on the delay here, I am working on re-applying my changes on top of these awesome improvements! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
