Re: [PR] Made DocIdsWriter use DISI when reading documents with an IntersectVisitor [lucene]

via GitHub Tue, 12 Mar 2024 23:41:52 -0700


antonha commented on PR #13149:
URL: https://github.com/apache/lucene/pull/13149#issuecomment-1993673162


   I spent some time "proving" this is luceneutil - 
[luceneutil/pull/257](https://github.com/mikemccand/luceneutil/pull/257) adds a 
reproduction - if run with `wikimediumall`, `optimize = True` for indexing and 
`commitPoint = 'single'`. 
   
   The reason that the larger segment is needed is that Lucene otherwise 
chooses to store document ids in the BKD leaves as int24, meaning that the 
optimization in this PR does nothing. With the luceneutil changes, I get the 
following output when comparing this PR to master:
   
   ``` 
                              TaskQPS baseline      StdDevQPS 
my_modified_version      StdDev                Pct diff p-value
                       OrHighNotLow      527.76      (7.9%)      516.38      
(8.0%)   -2.2% ( -16% -   14%) 0.392
                       OrHighNotMed      523.37      (7.3%)      512.63      
(7.5%)   -2.1% ( -15% -   13%) 0.384
                      OrHighNotHigh      418.72      (6.4%)      411.39      
(6.3%)   -1.8% ( -13% -   11%) 0.384
              BrowseMonthTaxoFacets       11.50     (22.5%)       11.32     
(26.2%)   -1.6% ( -41% -   60%) 0.839
                      OrNotHighHigh      419.16      (5.4%)      412.88      
(5.4%)   -1.5% ( -11% -    9%) 0.381
                        AndHighHigh       43.95      (5.4%)       43.42      
(3.8%)   -1.2% (  -9% -    8%) 0.413
               HighIntervalsOrdered        4.60      (5.6%)        4.55      
(6.0%)   -1.1% ( -12% -   11%) 0.561
                  HighTermMonthSort    13015.73      (3.3%)    12889.39      
(4.0%)   -1.0% (  -7% -    6%) 0.401
                             Fuzzy1      285.37      (3.3%)      282.87      
(2.2%)   -0.9% (  -6% -    4%) 0.320
           AndHighHighDayTaxoFacets        9.00      (4.6%)        8.92      
(4.8%)   -0.8% (  -9% -    8%) 0.577
                       HighSpanNear        4.59      (2.8%)        4.55      
(2.8%)   -0.8% (  -6% -    4%) 0.356
                       OrNotHighLow     1091.32      (2.8%)     1082.56      
(3.3%)   -0.8% (  -6% -    5%) 0.402
                        LowSpanNear       12.72      (2.4%)       12.64      
(2.5%)   -0.7% (  -5% -    4%) 0.377
               HighTermTitleBDVSort        5.69      (2.4%)        5.65      
(2.5%)   -0.7% (  -5% -    4%) 0.387
                             Fuzzy2      165.14      (2.4%)      164.10      
(1.8%)   -0.6% (  -4% -    3%) 0.343
                        MedSpanNear       12.61      (1.9%)       12.53      
(2.2%)   -0.6% (  -4% -    3%) 0.327
                           HighTerm      578.67      (6.5%)      575.09      
(5.1%)   -0.6% ( -11% -   11%) 0.739
               MedTermDayTaxoFacets       19.91      (5.1%)       19.79      
(4.3%)   -0.6% (  -9% -    9%) 0.683
             OrHighMedDayTaxoFacets        3.43     (12.0%)        3.41     
(11.4%)   -0.5% ( -21% -   25%) 0.885
                            MedTerm      748.03      (5.6%)      744.26      
(4.5%)   -0.5% ( -10% -   10%) 0.755
            AndHighMedDayTaxoFacets       32.72      (1.5%)       32.57      
(1.6%)   -0.5% (  -3% -    2%) 0.335
                            LowTerm      526.39      (2.7%)      524.40      
(2.7%)   -0.4% (  -5% -    5%) 0.653
                   HighSloppyPhrase        9.56      (2.3%)        9.53      
(2.9%)   -0.3% (  -5% -    5%) 0.689
                         HighPhrase       33.59      (3.8%)       33.49      
(3.8%)   -0.3% (  -7% -    7%) 0.803
                MedIntervalsOrdered       14.95      (4.2%)       14.92      
(4.4%)   -0.2% (  -8% -    8%) 0.876
          BrowseDayOfYearSSDVFacets        6.37      (1.4%)        6.36      
(1.3%)   -0.2% (  -2% -    2%) 0.665
                    MedSloppyPhrase       15.68      (1.6%)       15.65      
(2.8%)   -0.2% (  -4% -    4%) 0.799
                         AndHighMed      109.07      (3.9%)      108.94      
(3.2%)   -0.1% (  -6% -    7%) 0.915
                    LowSloppyPhrase       12.64      (1.5%)       12.63      
(2.2%)   -0.1% (  -3% -    3%) 0.855
                       OrNotHighMed      377.59      (3.0%)      377.64      
(3.3%)    0.0% (  -6% -    6%) 0.989
              BrowseMonthSSDVFacets        6.65      (0.4%)        6.65      
(0.4%)    0.0% (   0% -    0%) 0.829
                          OrHighLow      541.18      (3.3%)      541.47      
(3.3%)    0.1% (  -6% -    6%) 0.960
                LowIntervalsOrdered       12.42      (4.0%)       12.43      
(3.9%)    0.1% (  -7% -    8%) 0.948
        BrowseRandomLabelSSDVFacets        5.35      (1.3%)        5.36      
(1.1%)    0.1% (  -2% -    2%) 0.809
                          MedPhrase       47.26      (2.3%)       47.31      
(2.4%)    0.1% (  -4% -    4%) 0.889
                            Prefix3      246.21      (4.9%)      246.48      
(7.3%)    0.1% ( -11% -   12%) 0.956
                          LowPhrase       34.24      (2.5%)       34.30      
(2.4%)    0.2% (  -4% -    5%) 0.824
                  HighTermTitleSort       98.67      (2.9%)       98.84      
(4.6%)    0.2% (  -7% -    7%) 0.884
                         OrHighHigh       42.99      (8.8%)       43.10      
(8.6%)    0.3% ( -15% -   19%) 0.924
               BrowseDateSSDVFacets        2.24     (13.7%)        2.25     
(13.2%)    0.6% ( -23% -   31%) 0.895
                            Respell      255.75      (1.7%)      257.22      
(1.7%)    0.6% (  -2% -    4%) 0.281
          BrowseDayOfYearTaxoFacets        7.39      (1.9%)        7.44      
(4.1%)    0.7% (  -5% -    6%) 0.489
                         AndHighLow      961.38      (3.8%)      970.95      
(4.1%)    1.0% (  -6% -    9%) 0.427
               BrowseDateTaxoFacets        7.08      (1.3%)        7.16      
(4.6%)    1.1% (  -4% -    7%) 0.291
                         TermDTSort       77.89      (7.4%)       78.84      
(4.8%)    1.2% ( -10% -   14%) 0.535
                           PKLookup      250.25      (3.3%)      253.54      
(2.7%)    1.3% (  -4% -    7%) 0.169
              HighTermDayOfYearSort      107.64      (2.6%)      109.12      
(2.1%)    1.4% (  -3% -    6%) 0.064
                          OrHighMed      109.62      (6.2%)      111.24      
(6.1%)    1.5% ( -10% -   14%) 0.447
        BrowseRandomLabelTaxoFacets        6.17      (5.0%)        6.27      
(3.9%)    1.6% (  -6% -   11%) 0.253
                           Wildcard      183.93      (3.7%)      187.03      
(4.4%)    1.7% (  -6% -   10%) 0.193
                            LongNRQ       42.83      (6.1%)       83.47     
(34.5%)   94.9% (  51% -  144%) 0.000
                             IntNRQ       20.29      (3.4%)       53.24     
(34.2%)  162.4% ( 120% -  207%) 0.000
   ```
   
   The interesting parts here is in the bottom two lines - IntNRQ and LongNRQ 
becomes much faster. it might be that I messed up the "minimal" part of 
reproduction, maybe all that was needed is the single-segment and 
`taskCountPerCat` increase.
   
   Regardless, it looks promising - a 94% to 162% increase in QPS for range 
queries with this PR in the slightly modified benchmark.  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Made DocIdsWriter use DISI when reading documents with an IntersectVisitor [lucene]

Reply via email to