jpountz commented on PR #12489:
URL: https://github.com/apache/lucene/pull/12489#issuecomment-1685250650

   I think it's starting to look better now. I worked on some inefficiencies 
and applied some of the optimizations suggested by Mackenzie et al. in 
["Tradeoff Options for Bipartite Graph 
Partitioning"](https://assets.amazon.science/a2/af/ae2fd9c14e6dbf75838160c37a34/tradeoff-options-for-bipartite-graph-partitioning.pdf):
    - Use a simplified estimator that only requires two log computations.
    - Simulated annealing to stop iterating when the gain would be small by 
using the iteration number as a threshold.
   
   With the suggested defaults of minDocFreq=4,096 and minPartitionSize=32, I'm 
getting the following performance numbers on wikimedium10m (10M docs):
    - indexing (24 threads): 6.5 minutes
    - force-merging (single thread): 4.2 minutes
    - reordering doc IDs, including building a forward index by uninverting the 
inverted index (24 threads): 5.6 minutes
    - serializing the reordered view via addIndexes (single thread): 7.4 minutes
   
   Then comparing query performance, I'm getting interesting results. I had to 
disable verification of scores and counts because of the reordering. A quick 
manual check suggests that results are valid. I can guess why some queries like 
conjunctions are faster, but I'm not sure for `OrHighLow` or `HighPhrase`. 
Regarding sorting tasks, their performance is highly dependent on the index 
order, so I'm considering them as noise.
   
   ```
                               TaskQPS baseline      StdDevQPS 
my_modified_version      StdDev                Pct diff p-value
                          OrHighLow      583.26      (5.9%)      400.94      
(5.9%)  -31.3% ( -40% -  -20%) 0.000
                         HighPhrase       28.64      (7.6%)       20.78      
(5.0%)  -27.4% ( -37% -  -16%) 0.000
                         TermDTSort      113.57      (2.5%)       90.23      
(1.2%)  -20.5% ( -23% -  -17%) 0.000
                  HighTermTitleSort       72.11      (1.5%)       63.72      
(1.4%)  -11.6% ( -14% -   -8%) 0.000
                           PKLookup      290.91      (3.5%)      264.17      
(3.0%)   -9.2% ( -15% -   -2%) 0.000
                           HighTerm      634.93      (6.4%)      584.08      
(5.9%)   -8.0% ( -19% -    4%) 0.000
                             IntNRQ       54.77     (16.8%)       50.44     
(13.2%)   -7.9% ( -32% -   26%) 0.098
                  HighTermMonthSort     7652.28      (2.8%)     7294.07      
(3.2%)   -4.7% ( -10% -    1%) 0.000
                       OrHighNotLow      600.38      (5.5%)      600.98      
(5.3%)    0.1% ( -10% -   11%) 0.953
                            Respell      272.80      (2.2%)      274.82      
(2.3%)    0.7% (  -3% -    5%) 0.301
                           Wildcard      157.34      (4.4%)      160.16      
(3.9%)    1.8% (  -6% -   10%) 0.172
                   HighSloppyPhrase       19.99      (5.5%)       20.74      
(4.3%)    3.8% (  -5% -   14%) 0.016
                            Prefix3      840.82      (4.7%)      882.90      
(5.8%)    5.0% (  -5% -   16%) 0.002
                             Fuzzy1      361.19      (2.8%)      383.68      
(3.7%)    6.2% (   0% -   13%) 0.000
               HighIntervalsOrdered        8.51      (4.9%)        9.10      
(4.3%)    6.9% (  -2% -   16%) 0.000
                      OrHighNotHigh      408.11      (4.8%)      440.27      
(5.1%)    7.9% (  -1% -   18%) 0.000
                       HighSpanNear       23.57      (3.2%)       25.52      
(3.7%)    8.3% (   1% -   15%) 0.000
                      OrNotHighHigh      367.13      (4.2%)      397.89      
(4.2%)    8.4% (   0% -   17%) 0.000
                             Fuzzy2      188.81      (2.1%)      204.96      
(2.6%)    8.6% (   3% -   13%) 0.000
                MedIntervalsOrdered       28.61      (4.8%)       31.32      
(4.3%)    9.5% (   0% -   19%) 0.000
                        LowSpanNear       51.15      (3.3%)       56.30      
(2.6%)   10.1% (   4% -   16%) 0.000
                        MedSpanNear       46.95      (3.1%)       51.69      
(2.9%)   10.1% (   3% -   16%) 0.000
                    MedSloppyPhrase       59.70      (5.0%)       66.11      
(4.1%)   10.7% (   1% -   20%) 0.000
                       OrHighNotMed      514.44      (5.3%)      577.94      
(5.7%)   12.3% (   1% -   24%) 0.000
                LowIntervalsOrdered       78.83      (3.8%)       88.74      
(3.7%)   12.6% (   4% -   20%) 0.000
                    LowSloppyPhrase       54.64      (4.4%)       62.23      
(3.6%)   13.9% (   5% -   22%) 0.000
                         OrHighHigh       60.79      (8.6%)       69.68      
(9.4%)   14.6% (  -3% -   35%) 0.000
                            MedTerm      864.76      (5.3%)     1024.07      
(7.0%)   18.4% (   5% -   32%) 0.000
                          LowPhrase       80.72      (4.5%)       97.66      
(5.1%)   21.0% (  10% -   32%) 0.000
                          MedPhrase       49.16      (4.5%)       60.54      
(5.1%)   23.1% (  12% -   34%) 0.000
                         AndHighMed      183.79      (4.8%)      238.88      
(6.2%)   30.0% (  18% -   42%) 0.000
                        AndHighHigh       88.11      (5.8%)      114.97      
(7.0%)   30.5% (  16% -   46%) 0.000
                       OrNotHighMed      462.38      (3.5%)      606.15      
(4.7%)   31.1% (  22% -   40%) 0.000
               HighTermTitleBDVSort       27.04      (1.7%)       35.47      
(5.8%)   31.2% (  23% -   39%) 0.000
                       OrNotHighLow     1666.47      (3.9%)     2265.52      
(3.6%)   35.9% (  27% -   45%) 0.000
                          OrHighMed      170.06      (4.6%)      242.18      
(8.7%)   42.4% (  27% -   58%) 0.000
                            LowTerm     1032.01      (4.4%)     1520.01      
(7.2%)   47.3% (  34% -   61%) 0.000
                         AndHighLow     1577.56      (2.9%)     2337.64      
(6.8%)   48.2% (  37% -   59%) 0.000
              HighTermDayOfYearSort      199.31      (2.3%)      357.90      
(2.9%)   79.6% (  72% -   86%) 0.000
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to