Re: [PR] Random access term dictionary [lucene]

via GitHub Thu, 23 Nov 2023 23:05:09 -0800


Tony-X commented on PR #12688:
URL: https://github.com/apache/lucene/pull/12688#issuecomment-1825228925


   After some tweaking and tinkering I was finally able to get the bench 
running the way I wanted in luceneutil! @mikemccand 
   
   Unfortunately luceneutil out of the box doesn't work for my case...
   
   
   ```
                               TaskQPS baseline      StdDevQPS 
my_modified_version      StdDev                Pct diff p-value                 
                   [19/1802]
                           Wildcard       56.20      (2.0%)        7.30      
(0.2%)  -87.0% ( -87% -  -86%) 0.000
                            Respell       44.00      (1.7%)       14.46      
(0.6%)  -67.1% ( -68% -  -65%) 0.000
                             Fuzzy1       54.11      (1.2%)       20.86      
(0.8%)  -61.4% ( -62% -  -60%) 0.000
                            Prefix3      103.61      (0.8%)       41.37      
(0.5%)  -60.1% ( -60% -  -59%) 0.000
                             Fuzzy2       42.43      (1.2%)       20.26      
(0.9%)  -52.3% ( -53% -  -50%) 0.000
                  HighTermTitleSort      127.60      (1.7%)      114.65      
(1.6%)  -10.1% ( -13% -   -6%) 0.000
                  HighTermMonthSort     2549.22      (2.8%)     2435.74      
(3.8%)   -4.5% ( -10% -    2%) 0.037
                         AndHighLow      708.99      (2.3%)      678.15      
(2.1%)   -4.4% (  -8% -    0%) 0.001
                            LowTerm      369.16      (5.5%)      358.15      
(3.1%)   -3.0% ( -10% -    5%) 0.287
                       OrNotHighLow      258.51      (1.8%)      252.11      
(2.2%)   -2.5% (  -6% -    1%) 0.054
                          OrHighLow      348.58      (1.1%)      340.11      
(2.5%)   -2.4% (  -5% -    1%) 0.046
                      OrNotHighHigh      141.31      (6.7%)      138.54      
(3.2%)   -2.0% ( -11% -    8%) 0.554
                             IntNRQ       17.93      (1.8%)       17.62      
(3.7%)   -1.7% (  -7% -    3%) 0.349
                    MedSloppyPhrase       26.73      (0.8%)       26.28      
(1.3%)   -1.7% (  -3% -    0%) 0.017
               HighIntervalsOrdered        3.88      (2.6%)        3.82      
(2.3%)   -1.6% (  -6% -    3%) 0.314
                           HighTerm      290.77      (7.3%)      286.20      
(6.9%)   -1.6% ( -14% -   13%) 0.727
                      OrHighNotHigh      296.51      (6.9%)      291.90      
(4.9%)   -1.6% ( -12% -   11%) 0.682
                LowIntervalsOrdered       15.48      (1.1%)       15.26      
(1.3%)   -1.4% (  -3% -    0%) 0.057
                            MedTerm      405.54      (6.3%)      399.77      
(5.9%)   -1.4% ( -12% -   11%) 0.713
                    LowSloppyPhrase       44.64      (0.5%)       44.04      
(2.1%)   -1.3% (  -3% -    1%) 0.171
              HighTermDayOfYearSort      177.36      (1.6%)      175.08      
(1.8%)   -1.3% (  -4% -    2%) 0.234
                          OrHighMed       79.44      (3.7%)       78.48      
(3.0%)   -1.2% (  -7% -    5%) 0.572
                       OrNotHighMed      268.82      (4.3%)      265.74      
(3.1%)   -1.1% (  -8% -    6%) 0.632
              BrowseMonthTaxoFacets        3.89      (0.4%)        3.85      
(1.2%)   -0.9% (  -2% -    0%) 0.134
            AndHighMedDayTaxoFacets       44.33      (0.7%)       43.96      
(1.1%)   -0.8% (  -2% -    0%) 0.145
                        MedSpanNear       30.67      (1.1%)       30.42      
(2.3%)   -0.8% (  -4% -    2%) 0.481
                        LowSpanNear        4.56      (0.8%)        4.53      
(2.1%)   -0.6% (  -3% -    2%) 0.576
                       HighSpanNear        8.52      (1.4%)        8.47      
(2.1%)   -0.5% (  -3% -    2%) 0.636
                       OrHighNotLow      236.32      (6.6%)      235.28      
(4.4%)   -0.4% ( -10% -   11%) 0.901
           AndHighHighDayTaxoFacets        4.31      (0.6%)        4.29      
(0.8%)   -0.4% (  -1% -    1%) 0.446
                          LowPhrase       69.90      (1.2%)       69.67      
(2.7%)   -0.3% (  -4% -    3%) 0.807
                         AndHighMed       41.87      (0.6%)       41.75      
(2.7%)   -0.3% (  -3% -    3%) 0.828
                         OrHighHigh       45.86      (7.4%)       45.77      
(7.9%)   -0.2% ( -14% -   16%) 0.969
                         TermDTSort      101.90      (2.3%)      101.84      
(2.0%)   -0.1% (  -4% -    4%) 0.966
                   HighSloppyPhrase        0.48      (2.2%)        0.48      
(2.6%)   -0.0% (  -4% -    4%) 0.995
               BrowseDateSSDVFacets        1.00      (3.7%)        1.00      
(5.3%)   -0.0% (  -8% -    9%) 1.000
               HighTermTitleBDVSort        4.48      (5.4%)        4.49      
(4.4%)    0.0% (  -9% -   10%) 0.990
          BrowseDayOfYearSSDVFacets        3.51     (13.4%)        3.52     
(13.3%)    0.1% ( -23% -   30%) 0.990
              BrowseMonthSSDVFacets        3.38      (0.7%)        3.39      
(0.6%)    0.1% (  -1% -    1%) 0.740
          BrowseDayOfYearTaxoFacets        3.85      (0.3%)        3.85      
(0.5%)    0.2% (   0% -    0%) 0.469
                MedIntervalsOrdered        1.65      (1.5%)        1.65      
(1.5%)    0.2% (  -2% -    3%) 0.802
               BrowseDateTaxoFacets        3.82      (0.3%)        3.83      
(0.4%)    0.3% (   0% -    0%) 0.264
        BrowseRandomLabelSSDVFacets        2.34      (1.0%)        2.35      
(0.5%)    0.3% (  -1% -    1%) 0.494
                       OrHighNotMed      296.34      (7.6%)      297.38      
(4.6%)    0.4% ( -11% -   13%) 0.930
        BrowseRandomLabelTaxoFacets        3.29      (0.6%)        3.31      
(0.7%)    0.5% (   0% -    1%) 0.275
                          MedPhrase        9.84      (2.1%)        9.90      
(4.1%)    0.6% (  -5% -    6%) 0.787
                        AndHighHigh       33.50      (1.2%)       33.73      
(3.4%)    0.7% (  -3% -    5%) 0.669
                         HighPhrase       15.15      (2.3%)       15.28      
(4.1%)    0.9% (  -5% -    7%) 0.679
               MedTermDayTaxoFacets       13.36      (1.8%)       13.55      
(2.0%)    1.4% (  -2% -    5%) 0.240
             OrHighMedDayTaxoFacets        3.73      (1.4%)        3.83      
(2.2%)    2.6% (   0% -    6%) 0.026
                           PKLookup      147.84      (1.6%)      157.37      
(1.5%)    6.4% (   3% -    9%) 0.000
   ```
   
   ### Observations:
   #### PKLookup has improvement 
   This is reasonable as the terms index (FST) holds all the terms.
   
   #### Fuzzy/Wildcard/Prefix queries got *much slower* 
   This is also expected because currently I used the default implementation 
provided by `TermsEnum` which does not take advantage of the FST. With an 
optimized implementation I expect it to at least be on-par and slightly better 
because the FST holds information about all terms, whereas the current 
BlockTreeTerms only holds prefixes. 
   
   #### `HighTermTitleSort` and `HighTermMonthSort` got about 4.5% ~ 10% less 
throughput 
   I don't quite understand why term lookup could
   
   
   #### `AndHighLow` got slower
   Am i missing some optimization opportunity for low freq terms?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Random access term dictionary [lucene]

Reply via email to