[GitHub] [lucene] gsmiller commented on pull request #12055: Better skipping for multi-term queries with a FILTER rewrite.

via GitHub Fri, 17 Feb 2023 14:11:39 -0800


gsmiller commented on PR #12055:
URL: https://github.com/apache/lucene/pull/12055#issuecomment-1435332248


   OK, I think I've addressed the previous feedback and also brought in the 
same changes to `TermInSetQuery`. This should be ready for feedback @jpountz 
(whenever you have a free moment).
   
   On some internal benchmarks (Amazon product search), we see throughput 
increases ranging from ~7 - 63% (depending on a number of factors). We have 
some situations where pre-processing of long postings (in the existing TiS 
implementation) takes up a large share of CPU time (these tend to be cases 
where the actual matches are sparse but the TiS terms match a large number of 
docs). Being able to "hold back" these long postings into a 
`DisjunctionDisiAppox` while still pre-processing the shorter postings is a big 
win in these cases. Here are some flame charts showing the impact (heavily 
redacted of course):
   
   "Normal" TiS:
   <img width="1469" alt="Screen Shot 2023-02-15 at 7 51 01 AM" 
src="https://user-images.githubusercontent.com/16479560/219803198-c28ce8f5-1807-4c86-b3ec-7d6e33096533.png";>
   
   TiS with this PR:
   <img width="1415" alt="Screen Shot 2023-02-15 at 10 38 38 AM" 
src="https://user-images.githubusercontent.com/16479560/219803253-b785d652-cf37-42c6-bf7c-be6d18e69caf.png";>
   
   I also re-ran `luceneutil` benchmarks (`wikimedium10m`) and see consistent 
results with the initial PR:
   ```
                               TaskQPS baseline      StdDevQPS candidate      
StdDev                Pct diff p-value
              BrowseMonthTaxoFacets       32.05     (14.3%)       30.06     
(23.2%)   -6.2% ( -38% -   36%) 0.309
              BrowseMonthSSDVFacets       14.85     (17.6%)       13.95      
(2.3%)   -6.0% ( -22% -   16%) 0.127
        BrowseRandomLabelTaxoFacets       23.18     (12.4%)       21.93     
(19.8%)   -5.4% ( -33% -   30%) 0.303
               BrowseDateTaxoFacets       31.04     (14.2%)       29.42     
(22.7%)   -5.2% ( -36% -   37%) 0.383
          BrowseDayOfYearTaxoFacets       31.29     (14.4%)       29.66     
(22.9%)   -5.2% ( -37% -   37%) 0.389
          BrowseDayOfYearSSDVFacets       14.17     (17.7%)       13.85     
(14.0%)   -2.2% ( -28% -   35%) 0.659
                             IntNRQ      156.59      (5.3%)      154.50      
(7.1%)   -1.3% ( -12% -   11%) 0.498
                  HighTermTitleSort      100.27      (2.6%)       99.43      
(2.8%)   -0.8% (  -6% -    4%) 0.329
                  HighTermMonthSort     2633.69      (3.3%)     2614.59      
(3.1%)   -0.7% (  -6% -    5%) 0.477
                         AndHighLow     1085.57      (2.9%)     1080.12      
(2.5%)   -0.5% (  -5% -    5%) 0.559
                      OrNotHighHigh      795.28      (3.1%)      791.61      
(3.4%)   -0.5% (  -6% -    6%) 0.656
                         HighPhrase      145.14      (2.7%)      144.70      
(2.8%)   -0.3% (  -5% -    5%) 0.725
               BrowseDateSSDVFacets        3.81      (7.7%)        3.80      
(7.9%)   -0.3% ( -14% -   16%) 0.905
                             Fuzzy2       57.08      (1.3%)       56.95      
(1.2%)   -0.2% (  -2% -    2%) 0.555
                          LowPhrase      379.61      (2.4%)      378.74      
(2.8%)   -0.2% (  -5% -    5%) 0.783
                             Fuzzy1       76.62      (1.5%)       76.45      
(1.2%)   -0.2% (  -2% -    2%) 0.621
                          MedPhrase       19.20      (2.3%)       19.19      
(2.2%)   -0.1% (  -4% -    4%) 0.898
             OrHighMedDayTaxoFacets       15.40      (3.4%)       15.40      
(3.3%)   -0.0% (  -6% -    6%) 0.978
                       OrNotHighLow     1186.45      (3.0%)     1186.35      
(2.2%)   -0.0% (  -5% -    5%) 0.992
                        MedSpanNear        8.28      (2.7%)        8.28      
(2.6%)    0.0% (  -5% -    5%) 0.996
                         TermDTSort      104.78      (1.2%)      104.79      
(1.3%)    0.0% (  -2% -    2%) 0.984
                         AndHighMed      204.19      (3.2%)      204.21      
(3.7%)    0.0% (  -6% -    7%) 0.992
               MedTermDayTaxoFacets       51.30      (2.9%)       51.32      
(3.0%)    0.0% (  -5% -    6%) 0.970
               HighTermTitleBDVSort       21.28      (4.7%)       21.28      
(4.2%)    0.0% (  -8% -    9%) 0.975
                        LowSpanNear      179.68      (1.3%)      179.77      
(1.6%)    0.0% (  -2% -    3%) 0.919
            AndHighMedDayTaxoFacets       25.84      (1.6%)       25.85      
(1.7%)    0.0% (  -3% -    3%) 0.923
                          OrHighMed      106.73      (3.0%)      106.83      
(3.4%)    0.1% (  -6% -    6%) 0.932
                       OrNotHighMed      329.39      (3.1%)      329.71      
(3.4%)    0.1% (  -6% -    6%) 0.925
                            Respell       49.02      (0.9%)       49.09      
(0.6%)    0.1% (  -1% -    1%) 0.546
                MedIntervalsOrdered        4.40      (5.8%)        4.40      
(5.5%)    0.1% ( -10% -   12%) 0.933
           AndHighHighDayTaxoFacets       13.00      (1.7%)       13.02      
(1.5%)    0.2% (  -2% -    3%) 0.760
                           PKLookup      181.72      (2.5%)      182.00      
(2.2%)    0.2% (  -4% -    4%) 0.833
                       HighSpanNear       23.16      (1.5%)       23.20      
(1.9%)    0.2% (  -3% -    3%) 0.763
                       OrHighNotLow      420.49      (3.3%)      421.43      
(5.0%)    0.2% (  -7% -    8%) 0.867
                LowIntervalsOrdered       32.01      (3.6%)       32.08      
(3.6%)    0.2% (  -6% -    7%) 0.838
                    MedSloppyPhrase      136.79      (2.9%)      137.27      
(2.7%)    0.3% (  -5% -    6%) 0.696
              HighTermDayOfYearSort      233.88      (2.3%)      234.72      
(2.9%)    0.4% (  -4% -    5%) 0.663
                    LowSloppyPhrase       24.70      (2.6%)       24.80      
(2.5%)    0.4% (  -4% -    5%) 0.604
                       OrHighNotMed      477.53      (3.0%)      479.58      
(5.0%)    0.4% (  -7% -    8%) 0.742
                      OrHighNotHigh      274.34      (3.4%)      275.55      
(4.8%)    0.4% (  -7% -    8%) 0.737
               HighIntervalsOrdered        1.86      (3.4%)        1.87      
(3.3%)    0.5% (  -6% -    7%) 0.652
        BrowseRandomLabelSSDVFacets       10.16      (8.9%)       10.22      
(9.0%)    0.5% ( -15% -   20%) 0.850
                          OrHighLow      426.87      (2.9%)      429.35      
(4.0%)    0.6% (  -6% -    7%) 0.599
                            MedTerm      520.48      (3.7%)      524.12      
(5.6%)    0.7% (  -8% -   10%) 0.643
                           HighTerm      522.07      (3.3%)      526.23      
(5.2%)    0.8% (  -7% -    9%) 0.562
                   HighSloppyPhrase        9.98      (4.5%)       10.06      
(4.2%)    0.8% (  -7% -    9%) 0.562
                         OrHighHigh       29.97      (4.5%)       30.24      
(5.5%)    0.9% (  -8% -   11%) 0.570
                            LowTerm      732.69      (3.2%)      740.93      
(4.5%)    1.1% (  -6% -    9%) 0.362
                        AndHighHigh       29.62      (5.1%)       29.99      
(5.9%)    1.2% (  -9% -   12%) 0.481
                            Prefix3      138.32      (1.3%)      178.19      
(1.9%)   28.8% (  25% -   32%) 0.000
                           Wildcard      393.93      (1.5%)      759.76      
(3.8%)   92.9% (  86% -   99%) 0.000
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller commented on pull request #12055: Better skipping for multi-term queries with a FILTER rewrite.

Reply via email to