jimczi commented on PR #12551: URL: https://github.com/apache/lucene/pull/12551#issuecomment-1719529457
I made some adjustments to the formula to consider the logarithmic complexity of the greedy search. I conducted tests on two datasets: 1. The standard SIFT dataset, which has 128 dimensions and 1 million documents. 2. The Cohere Wikipedia dataset, where I used the first 1 million documents and randomly selected 100 queries from the rest of the dataset. I compared the performance between two different indexing strategies: - One using forced merging with a single segment. - Another using the default merging strategy, starting with a writer buffer size of 1000 documents. SIFT results: | Method | k | ef | recall | # comparisons | | ------------- | ------------- | ------------- |------------- |------------- | | Single Segment|10|100|0.973000|1361| |Multiple Segments|10|100|0.999000|27278| |Multiple Segments with Dynamic Strategy|10|100|0.987000|10137| |Single Segment|100|1000|0.987700|7926| |Multiple Segments|100|1000|0.999500|146868| |Multiple Segments with Dynamic Strategy|100|1000|0.999200|58434| |Single Segment|1000|10000|0.977250|63692| |Multiple Segments|1000|10000|0.999080|526198| |Multiple Segments with Dynamic Strategy|1000|10000|0.998770|292739| Cohere results: | Method | k | ef | recall | # comparisons | | ------------- | ------------- | ------------- |------------- |------------- | |Single Segment|10|100|0.947000|1920| |Multiple Segments|10|100|0.996000|45096| |Multiple Segments with Dynamic Strategy|10|100|0.980000|18565| |Single Segment|100|1000|0.977800|12848| |Multiple Segments|100|1000|0.999200|199685| |Multiple Segments with Dynamic Strategy|100|1000|0.997900|96321| |Single Segment|1000|10000|0.992530|81033 |Multiple Segments|1000|10000|0.999900|584438 |Multiple Segments with Dynamic Strategy|1000|10000|0.999700|407804 We already knew that searching across multiple segments improves recall but comes with a higher computational cost. For example, with 37 segments, searching with a size of 100 requires 23 times more vector comparisons compared to the single-segment version on the Cohere dataset. While the recall is significantly better, it may be challenging for users to strike the right balance between performance and recall since the number of segments can vary considerably during indexing. This is where using the segment size to determine the optimal queue size seems like a reasonable compromise. For the Cohere dataset, it reduced the number of comparisons by a factor of 2.5 (for the 10-100 case) while still maintaining better recall than the single-segment case. Similar improvements were observed with the SIFT dataset. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org