jimczi commented on PR #12551:
URL: https://github.com/apache/lucene/pull/12551#issuecomment-1719529457

   I made some adjustments to the formula to consider the logarithmic 
complexity of the greedy search. I conducted tests on two datasets: 
   
   1. The standard SIFT dataset, which has 128 dimensions and 1 million 
documents.
   2. The Cohere Wikipedia dataset, where I used the first 1 million documents 
and randomly selected 100 queries from the rest of the dataset.
   
   I compared the performance between two different indexing strategies:
   
   - One using forced merging with a single segment.
   - Another using the default merging strategy, starting with a writer buffer 
size of 1000 documents.
   
   SIFT results:
   
   | Method  | k | ef | recall | # comparisons |
   | ------------- | ------------- | ------------- |------------- 
|------------- |
   | Single Segment|10|100|0.973000|1361|
   |Multiple Segments|10|100|0.999000|27278|
   |Multiple Segments with Dynamic Strategy|10|100|0.987000|10137|
   |Single Segment|100|1000|0.987700|7926|
   |Multiple Segments|100|1000|0.999500|146868|
   |Multiple Segments with Dynamic Strategy|100|1000|0.999200|58434|
   |Single Segment|1000|10000|0.977250|63692|
   |Multiple Segments|1000|10000|0.999080|526198|
   |Multiple Segments with Dynamic Strategy|1000|10000|0.998770|292739|
   
   Cohere results:
   
   | Method  | k | ef | recall | # comparisons |
   | ------------- | ------------- | ------------- |------------- 
|------------- |
   |Single Segment|10|100|0.947000|1920|
   |Multiple Segments|10|100|0.996000|45096|
   |Multiple Segments with Dynamic Strategy|10|100|0.980000|18565|
   |Single Segment|100|1000|0.977800|12848|
   |Multiple Segments|100|1000|0.999200|199685|
   |Multiple Segments with Dynamic Strategy|100|1000|0.997900|96321|
   |Single Segment|1000|10000|0.992530|81033
   |Multiple Segments|1000|10000|0.999900|584438
   |Multiple Segments with Dynamic Strategy|1000|10000|0.999700|407804
   
   We already knew that searching across multiple segments improves recall but 
comes with a higher computational cost. For example, with 37 segments, 
searching with a size of 100 requires 23 times more vector comparisons compared 
to the single-segment version on the Cohere dataset. While the recall is 
significantly better, it may be challenging for users to strike the right 
balance between performance and recall since the number of segments can vary 
considerably during indexing.
   
   This is where using the segment size to determine the optimal queue size 
seems like a reasonable compromise. For the Cohere dataset, it reduced the 
number of comparisons by a factor of 2.5 (for the 10-100 case) while still 
maintaining better recall than the single-segment case. Similar improvements 
were observed with the SIFT dataset.
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to