jmazanec15 commented on issue #11354:
URL: https://github.com/apache/lucene/issues/11354#issuecomment-1239961308

   I just finished the initial set of experiments. Seems like there still may 
be some issues with the implementation.
   
   ### Setup
   For the data set, I used the [sift data 
set](https://github.com/erikbern/ann-benchmarks/#data-sets) from 
ann-benchmarks. I created a small script to put the data into the correct 
format: 
[hdf5-dump.py](https://gist.github.com/jmazanec15/e274c7bdb88301bac7b2cf032cc33396).
 
   
   I ran all tests on a c5.4xlarge instance. 
   
   To test merge, I uncommented these [two 
lines](https://github.com/apache/lucene/blob/branch_9_4/lucene/core/src/test/org/apache/lucene/util/hnsw/KnnGraphTester.java#L703-L706)
 and varied the maxBufferedDocs as 10K, 100K and 500K. I used the following 
command after building the repo:
   ```
   cd lucene/core/build
   java -cp 
libs/lucene-core-10.0.0-SNAPSHOT.jar:classes/java/test:../../test-framework/build/classes/java/main:../../codecs/build/classes/java/main/
 org.apache.lucene.util.hnsw.KnnGraphTester -search 
/home/ec2-user/merge-test/data/sift-128-euclidean.test -docs 
/home/ec2-user/merge-test/data/sift-128-euclidean.train -ndoc 1000000 -maxConn 
16 -beamWidthIndex 100 -niter 100 -dim 128 -reindex -metric euclidean 
-forceMerge > test-1 2>&1
   ```
   
   Then I grepped for the merge metrics:
   ```
   cat test-1 | grep "msec to merge numeric"
   ```
   I also captured recall on 100 queries to see how search was influenced.
   
   I ran the 3 sets of experiments, 3 times each.
   
   ### Results
   #### 10K
   | Exper.      | time to merge (ms) | QPS | Recall |
   | ----------- | ----------------------------- | --- | ------ |
   | Control 1 | 611190 | 740 | 0.977 |
   | Control 2 | 621678 | 769 | 0.977 |
   | Control 3 | 619656 | 769 | 0.977 |
   | Test 1 | 691339 | 657 | 0.976 |
   | Test 2 | 760199 | 689 | 0.976 |
   | Test 3 | 685738 | 704 | 0.976 |
   
   #### 100K
   | Exper.      | time to merge (ms) | QPS | Recall |
   | ----------- | ----------------------------- | --- | ------ |
   | Control 1 | 621603 | 775 | 0.977 |
   | Control 2 | 627452 | 769 | 0.977 |
   | Control 3 | 628613 | 833 | 0.977 |
   | Test 1 | 616133 | 746 | 0.973 |
   | Test 2 | 636186 | 699 | 0.973 |
   | Test 3 | 638978 | 709 | 0.973 |
   
   #### 500K
   | Exper.      | time to merge (ms) | QPS | Recall |
   | ----------- | ----------------------------- | --- | ------ |
   | Control 1 | 671704 | 763 | 0.977 |
   | Control 2 | 643735 | 714 | 0.977 |
   | Control 3 | 639047 | 800 | 0.977 |
   | Test 1 | 398604 | 699 | 0.962 |
   | Test 2 | 409549 | 751 | 0.962 |
   | Test 3 | 370152 | 775 | 0.962 |
   
   ### Conclusions
   
   From the experiments above, it seems that initializing from a graph during 
merge works well when few segments are being merged, but adds a cost when a lot 
of segments are being merged. Need to investigate why this might be happening. 
   
   Additionally, my implementation appears to reduce recall slightly compared 
to the control. Im going to see if I can figure out why this might be happening.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to