[ https://issues.apache.org/jira/browse/LUCENE-10593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17558093#comment-17558093 ]
Alessandro Benedetti commented on LUCENE-10593: ----------------------------------------------- Hi @msokolov @mayya-sharipova and @jtibshirani , I have finally finished my performance tests. Initially the results were worse in this branch, I found that suspicious as I expected the removal of the BoundChecker and the removal of the reverse mechanism to outweigh the additional division in the distance measure during graph building and searching. After a deep investigation I found the culprit (you see it in the latest commit). {code:java} if (neighborSimilarity >= score) { if ((neighborSimilarity < score) == false) { // this version improves the performance dramatically in both indexing/searching {code} After that fix, the results are very encouraging. There are strong speedup for both angular and euclidean distances, both for indexing and searching. *If this is validated we are getting a great cleanup of the code and also a nice performance boost.* I'll have my colleague @eliaporciani to repeat the tests on Apple M1. The following tests were executed on Intellij running the org.apache.lucene.util.hnsw.KnnGraphTester. 2.4 GHz 8-Core Intel Core i9 - 32 GB 2667 MHz DDR4 {noformat} `INDEXING EUCLIDEAN -beamWidthIndex 100 -maxConn 16 -ndoc 80000 -reindex -docs /Users/sease/JavaProjects/ann-benchmarks/ann_benchmarks/datasets/sift-128-euclidean.hdf5 -metric euclidean ORIGINAL IW 0 [2022-06-22T14:00:12.647030Z; main]: 64335 msec to write vectors IW 0 [2022-06-22T14:01:57.425108Z; main]: 65710 msec to write vectors IW 0 [2022-06-22T14:03:18.052900Z; main]: 64817 msec to write vectors THIS BRANCH IW 0 [2022-06-22T14:04:50.683607Z; main]: 6597 msec to write vectors IW 0 [2022-06-22T14:05:34.090801Z; main]: 6687 msec to write vectors IW 0 [2022-06-22T14:06:00.268309Z; main]: 6564 msec to write vectors INDEXING ANGULAR -beamWidthIndex 100 -maxConn 16 -ndoc 80000 -reindex -docs /Users/sease/JavaProjects/ann-benchmarks/ann_benchmarks/datasets/lastfm-64-dot.hdf5 -metric angular ORIGINAL IW 0 [2022-06-22T13:55:45.401310Z; main]: 32897 msec to write vectors IW 0 [2022-06-22T13:56:39.737642Z; main]: 33255 msec to write vectors IW 0 [2022-06-22T13:57:31.172709Z; main]: 32576 msec to write vectors THIS BRANCH IW 0 [2022-06-22T13:52:06.085790Z; main]: 25261 msec to write vectors IW 0 [2022-06-22T13:52:51.022766Z; main]: 25775 msec to write vectors IW 0 [2022-06-22T13:53:47.565833Z; main]: 24523 msec to write vectors` `SEARCH EUCLIDEAN -niter 500 -beamWidthIndex 100 -maxConn 16 -ndoc 80000 -reindex -docs /Users/sease/JavaProjects/ann-benchmarks/ann_benchmarks/datasets/sift-128-euclidean.hdf5 -search /Users/sease/JavaProjects/ann-benchmarks/ann_benchmarks/datasets/sift-128-euclidean.hdf5 -metric euclidean ORIGINAL completed 500 searches in 1026 ms: 487 QPS CPU time=1025ms completed 500 searches in 1030 ms: 485 QPS CPU time=1029ms completed 500 searches in 1031 ms: 484 QPS CPU time=1030ms THIS BRANCH completed 500 searches in 46 ms: 10869 QPS CPU time=46ms completed 500 searches in 46 ms: 10869 QPS CPU time=46ms completed 500 searches in 47 ms: 10638 QPS CPU time=46ms SEARCH ANGULAR -niter 500 -beamWidthIndex 100 -maxConn 16 -ndoc 80000 -reindex -docs /Users/sease/JavaProjects/ann-benchmarks/ann_benchmarks/datasets/lastfm-64-dot.hdf5 -search /Users/sease/JavaProjects/ann-benchmarks/ann_benchmarks/datasets/lastfm-64-dot.hdf5 -metric angular ORIGINAL completed 500 searches in 154 ms: 3246 QPS CPU time=153ms completed 500 searches in 162 ms: 3086 QPS CPU time=162ms completed 500 searches in 166 ms: 3012 QPS CPU time=166ms THIS BRANCH completed 500 searches in 62 ms: 8064 QPS CPU time=62ms completed 500 searches in 65 ms: 7692 QPS CPU time=65ms completed 500 searches in 63 ms: 7936 QPS CPU time=62ms ` {noformat} Please correct me in case I did anything wrong, it's the first time I was using the org.apache.lucene.util.hnsw.KnnGraphTester > VectorSimilarityFunction reverse removal > ---------------------------------------- > > Key: LUCENE-10593 > URL: https://issues.apache.org/jira/browse/LUCENE-10593 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Alessandro Benedetti > Priority: Major > Labels: vector-based-search > > org.apache.lucene.index.VectorSimilarityFunction#EUCLIDEAN similarity behaves > in an opposite way in comparison to the other similarities: > A higher similarity score means higher distance, for this reason, has been > marked with "reversed" and a function is present to map from the similarity > to a score (where higher means closer, like in all other similarities.) > Having this counterintuitive behavior with no apparent explanation I could > find(please correct me if I am wrong) brings a lot of nasty side effects for > the code readability, especially when combined with the NeighbourQueue that > has a "reversed" itself. > In addition, it complicates also the usage of the pattern: > Result Queue -> MIN HEAP > Candidate Queue -> MAX HEAP > In HNSW searchers. > The proposal in my Pull Request aims to: > 1) the Euclidean similarity just returns the score, in line with the other > similarities, with the formula currently used to move from distance to score > 2) simplify the code, removing the bound checker that's not necessary anymore > 3) refactor here and there to be in line with the simplification > 4) refactor of NeighborQueue to clearly state when it's a MIN_HEAP or > MAX_HEAP, now debugging is much easier and understanding the HNSW code is > much more intuitive -- This message was sent by Atlassian Jira (v8.20.7#820007) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org