[GitHub] [lucene] gsmiller commented on pull request #11928: GH#11922: Allow DisjunctionDISIApproximation to short-circuit

GitBox Tue, 29 Nov 2022 12:54:07 -0800


gsmiller commented on PR #11928:
URL: https://github.com/apache/lucene/pull/11928#issuecomment-1331288870


   @jpountz I re-ran some internal benchmarking with this change to highlight 
the speedup in cases where scoring isn't needed (at least some specific 
use-cases I'm looking at). These use-cases all involve a "disjunction filter," 
meaning a disjunction of terms that is used as a required clause. So something 
like `(+ (foo:bar foo:baz foo:zed) (...))`, where the `foo` field must take on 
one of the specified values to be considered a candidate match. To provide a 
sense of scale, on average, these filters have 40 different terms in them. 
Since these "filters" don't participate in scoring at all, it's a good 
candidate for this short-circuiting.
   
   In these benchmarks, I'm observing a 2.3% QPS improvement, and a 3.5% avg. 
latency reduction (5.9% p50 reduction / 3.5% p99 reduction). So the change 
appears to be helping this type of situation.
   
   As for whether-or-not this change would actually hurt other common use-cases 
that require scoring or second-phase checks, I re-ran `luceneutil` benchmarks 
(wikimedium10m) task and don't observe any regressions there (results below). 
It's possible there's a gap in our benchmarks though, and maybe there are some 
common use-cases not covered?
   
   ```
                               TaskQPS baseline      StdDevQPS candidate      
StdDev                Pct diff p-value
                    MedSloppyPhrase      115.00      (4.9%)      112.91      
(5.1%)   -1.8% ( -11% -    8%) 0.249
                  HighTermTitleSort      242.48      (3.2%)      238.66      
(4.3%)   -1.6% (  -8% -    6%) 0.189
                   HighSloppyPhrase       36.47      (3.8%)       36.12      
(3.9%)   -0.9% (  -8% -    7%) 0.439
                            LowTerm     1766.82      (3.4%)     1752.12      
(3.3%)   -0.8% (  -7% -    6%) 0.436
                         HighPhrase      263.74      (3.4%)      261.71      
(2.3%)   -0.8% (  -6% -    5%) 0.404
                          OrHighLow      796.71      (2.7%)      790.71      
(2.6%)   -0.8% (  -5% -    4%) 0.367
               BrowseDateSSDVFacets        3.46      (6.4%)        3.44      
(7.1%)   -0.7% ( -13% -   13%) 0.755
                  HighTermMonthSort     3070.86      (4.5%)     3051.39      
(3.9%)   -0.6% (  -8% -    8%) 0.635
                            Prefix3      111.76      (4.6%)      111.08      
(4.2%)   -0.6% (  -8% -    8%) 0.658
                      OrNotHighHigh     1249.27      (3.4%)     1242.30      
(3.8%)   -0.6% (  -7% -    6%) 0.627
              BrowseMonthTaxoFacets       35.43      (1.6%)       35.23      
(2.2%)   -0.6% (  -4% -    3%) 0.367
                    LowSloppyPhrase       61.49      (2.4%)       61.18      
(2.5%)   -0.5% (  -5% -    4%) 0.512
                       OrHighNotMed     1139.05      (3.9%)     1133.29      
(3.6%)   -0.5% (  -7% -    7%) 0.671
        BrowseRandomLabelTaxoFacets       20.33      (4.5%)       20.23      
(5.6%)   -0.5% ( -10% -   10%) 0.760
                           HighTerm     1635.45      (4.2%)     1628.28      
(3.9%)   -0.4% (  -8% -    8%) 0.735
                          MedPhrase       46.41      (2.3%)       46.22      
(1.6%)   -0.4% (  -4% -    3%) 0.529
                          OrHighMed      193.55      (2.8%)      192.79      
(2.9%)   -0.4% (  -5% -    5%) 0.663
                      OrHighNotHigh      865.20      (3.0%)      862.38      
(3.9%)   -0.3% (  -6% -    6%) 0.766
                         AndHighLow     1566.83      (2.7%)     1562.47      
(2.7%)   -0.3% (  -5% -    5%) 0.745
               MedTermDayTaxoFacets       48.00      (3.5%)       47.89      
(3.6%)   -0.2% (  -7% -    7%) 0.836
              HighTermDayOfYearSort      812.55      (2.8%)      811.49      
(2.5%)   -0.1% (  -5% -    5%) 0.878
                            MedTerm     2390.59      (3.5%)     2387.70      
(3.5%)   -0.1% (  -6% -    7%) 0.912
               BrowseDateTaxoFacets       25.05      (9.2%)       25.03      
(9.3%)   -0.1% ( -16% -   20%) 0.984
              BrowseMonthSSDVFacets       16.00     (18.9%)       15.99     
(19.0%)   -0.1% ( -31% -   46%) 0.992
                LowIntervalsOrdered      276.77      (3.5%)      276.66      
(3.6%)   -0.0% (  -6% -    7%) 0.972
                         OrHighHigh       26.41      (4.4%)       26.40      
(4.4%)   -0.0% (  -8% -    9%) 0.978
                        MedSpanNear       35.38      (1.0%)       35.38      
(1.1%)   -0.0% (  -2% -    2%) 0.969
                         TermDTSort      648.08      (1.0%)      648.35      
(1.3%)    0.0% (  -2% -    2%) 0.909
          BrowseDayOfYearTaxoFacets       25.12      (9.5%)       25.13      
(9.5%)    0.0% ( -17% -   21%) 0.988
                       OrNotHighLow     1192.74      (3.1%)     1193.69      
(2.6%)    0.1% (  -5% -    5%) 0.929
               HighIntervalsOrdered        1.79      (3.3%)        1.79      
(3.3%)    0.1% (  -6% -    6%) 0.934
                            Respell       46.93      (1.3%)       46.97      
(1.3%)    0.1% (  -2% -    2%) 0.825
                        LowSpanNear       11.96      (0.8%)       11.97      
(0.9%)    0.1% (  -1% -    1%) 0.702
        BrowseRandomLabelSSDVFacets       10.14      (9.0%)       10.16      
(9.4%)    0.1% ( -16% -   20%) 0.961
                          LowPhrase      438.44      (1.6%)      439.23      
(1.4%)    0.2% (  -2% -    3%) 0.701
           AndHighHighDayTaxoFacets       28.52      (1.9%)       28.57      
(1.8%)    0.2% (  -3% -    3%) 0.746
                MedIntervalsOrdered       64.28      (2.3%)       64.42      
(2.3%)    0.2% (  -4% -    4%) 0.777
                       OrHighNotLow     1854.91      (2.7%)     1858.96      
(3.9%)    0.2% (  -6% -    6%) 0.836
                       HighSpanNear       12.65      (1.4%)       12.69      
(1.7%)    0.3% (  -2% -    3%) 0.579
             OrHighMedDayTaxoFacets       14.34      (3.0%)       14.38      
(4.3%)    0.3% (  -6% -    7%) 0.809
                             Fuzzy1       81.66      (0.9%)       81.90      
(1.1%)    0.3% (  -1% -    2%) 0.357
          BrowseDayOfYearSSDVFacets       13.40     (12.7%)       13.44     
(12.5%)    0.3% ( -22% -   29%) 0.942
            AndHighMedDayTaxoFacets       34.33      (2.1%)       34.45      
(1.7%)    0.3% (  -3% -    4%) 0.569
                             Fuzzy2       76.36      (0.8%)       76.64      
(0.9%)    0.4% (  -1% -    2%) 0.187
               HighTermTitleBDVSort       12.49      (6.6%)       12.55      
(6.6%)    0.4% ( -11% -   14%) 0.846
                           PKLookup      179.34      (2.2%)      180.48      
(2.3%)    0.6% (  -3% -    5%) 0.364
                         AndHighMed      268.67      (3.8%)      270.48      
(3.9%)    0.7% (  -6% -    8%) 0.579
                       OrNotHighMed     1006.62      (4.1%)     1013.59      
(4.3%)    0.7% (  -7% -    9%) 0.604
                           Wildcard      133.41      (2.7%)      134.34      
(2.7%)    0.7% (  -4% -    6%) 0.417
                        AndHighHigh       74.34      (4.7%)       74.95      
(5.2%)    0.8% (  -8% -   11%) 0.608
                             IntNRQ       82.56      (3.5%)       83.70      
(4.9%)    1.4% (  -6% -   10%) 0.306
   ```
   
   Given this, I'd prefer to move forward with the change (since it doesn't 
seem to be doing any harm to common cases covered in `luceneutil` and is 
helping in the filtering use-cases I'm benchmarking), but I'd like your 
opinion. Given how central of a component this is to scoring, I want to make 
sure I'm not being overly biased towards the use-case I'm focused on at the 
cost of regressing elsewhere. Thanks again for your inputs!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller commented on pull request #11928: GH#11922: Allow DisjunctionDISIApproximation to short-circuit

Reply via email to