jmazanec15 commented on issue #11354: URL: https://github.com/apache/lucene/issues/11354#issuecomment-1239961308
I just finished the initial set of experiments. Seems like there still may be some issues with the implementation. ### Setup For the data set, I used the [sift data set](https://github.com/erikbern/ann-benchmarks/#data-sets) from ann-benchmarks. I created a small script to put the data into the correct format: [hdf5-dump.py](https://gist.github.com/jmazanec15/e274c7bdb88301bac7b2cf032cc33396). I ran all tests on a c5.4xlarge instance. To test merge, I uncommented these [two lines](https://github.com/apache/lucene/blob/branch_9_4/lucene/core/src/test/org/apache/lucene/util/hnsw/KnnGraphTester.java#L703-L706) and varied the maxBufferedDocs as 10K, 100K and 500K. I used the following command after building the repo: ``` cd lucene/core/build java -cp libs/lucene-core-10.0.0-SNAPSHOT.jar:classes/java/test:../../test-framework/build/classes/java/main:../../codecs/build/classes/java/main/ org.apache.lucene.util.hnsw.KnnGraphTester -search /home/ec2-user/merge-test/data/sift-128-euclidean.test -docs /home/ec2-user/merge-test/data/sift-128-euclidean.train -ndoc 1000000 -maxConn 16 -beamWidthIndex 100 -niter 100 -dim 128 -reindex -metric euclidean -forceMerge > test-1 2>&1 ``` Then I grepped for the merge metrics: ``` cat test-1 | grep "msec to merge numeric" ``` I also captured recall on 100 queries to see how search was influenced. I ran the 3 sets of experiments, 3 times each. ### Results #### 10K | Exper. | time to merge (ms) | QPS | Recall | | ----------- | ----------------------------- | --- | ------ | | Control 1 | 611190 | 740 | 0.977 | | Control 2 | 621678 | 769 | 0.977 | | Control 3 | 619656 | 769 | 0.977 | | Test 1 | 691339 | 657 | 0.976 | | Test 2 | 760199 | 689 | 0.976 | | Test 3 | 685738 | 704 | 0.976 | #### 100K | Exper. | time to merge (ms) | QPS | Recall | | ----------- | ----------------------------- | --- | ------ | | Control 1 | 621603 | 775 | 0.977 | | Control 2 | 627452 | 769 | 0.977 | | Control 3 | 628613 | 833 | 0.977 | | Test 1 | 616133 | 746 | 0.973 | | Test 2 | 636186 | 699 | 0.973 | | Test 3 | 638978 | 709 | 0.973 | #### 500K | Exper. | time to merge (ms) | QPS | Recall | | ----------- | ----------------------------- | --- | ------ | | Control 1 | 671704 | 763 | 0.977 | | Control 2 | 643735 | 714 | 0.977 | | Control 3 | 639047 | 800 | 0.977 | | Test 1 | 398604 | 699 | 0.962 | | Test 2 | 409549 | 751 | 0.962 | | Test 3 | 370152 | 775 | 0.962 | ### Conclusions From the experiments above, it seems that initializing from a graph during merge works well when few segments are being merged, but adds a cost when a lot of segments are being merged. Need to investigate why this might be happening. Additionally, my implementation appears to reduce recall slightly compared to the control. Im going to see if I can figure out why this might be happening. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org