[I] Adding option to codec to disable patching in Lucene's PFOR encoding [lucene]

via GitHub Wed, 18 Oct 2023 10:02:50 -0700


slow-J opened a new issue, #12696:
URL: https://github.com/apache/lucene/issues/12696


   ### Description
   
   Background: In https://github.com/Tony-X/search-benchmark-game we were 
comparing performance of Tantivy and Lucene. "One difference between Lucene and 
Tantivy is Lucene uses the "patch" FOR, meaning the large values in a block are 
held out as exceptions so that the remaining values can use a smaller number of 
bits to encode, a tradeoff of CPU for lower storage space." In 
https://github.com/Tony-X/search-benchmark-game/issues/46 , I disable the 
patching in Lucene, to match how Tantivy encodes and run the 
search-benchmark-game to test the change.
   
   Lucene modifications for testing: I cloned the pforUtil and removed all 
logic related to patching the exceptions. I modified the Lucene90PostingsReader 
+ Writer to use the util with no patching logic, see sample code 
https://github.com/slow-J/lucene/commit/83ec5a8b9f7ed39b8aa3ee948ffe5288a9d3fb16
   Hardware used: EC2 Graviton3 instance, m6g.4xlarge
   
   Results from the search-benchmark-game: 
https://github.com/Tony-X/search-benchmark-game/issues/46#issuecomment-1693714327
   We saw Lucene's latency improve: -2% in COUNT, -2% in TOP_10_COUNT, -2.07% 
in TOP_100.
   
   I then ran a Lucene benchmark with 
[luceneutil](https://github.com/mikemccand/luceneutil) `python3 
src/python/localrun.py -source wikimediumall -r`
   Hardware used: EC2 Graviton3 instance, m6g.4xlarge
   
   Posting results below
   ```
                               TaskQPS baseline      StdDevQPS 
my_modified_version      StdDev                Pct diff p-value
               BrowseDateSSDVFacets        0.95      (4.5%)        0.95      
(5.9%)   -1.0% ( -10% -    9%) 0.566
                  HighTermMonthSort     2558.85      (2.6%)     2539.66      
(4.6%)   -0.7% (  -7% -    6%) 0.526
        BrowseRandomLabelSSDVFacets        2.48      (4.3%)        2.47      
(1.4%)   -0.7% (  -6% -    5%) 0.486
                            Prefix3      139.77      (1.7%)      139.41      
(3.1%)   -0.3% (  -4% -    4%) 0.735
        BrowseRandomLabelTaxoFacets        3.38      (1.5%)        3.37      
(1.8%)   -0.2% (  -3% -    3%) 0.689
                             Fuzzy2       50.20      (2.0%)       50.11      
(2.1%)   -0.2% (  -4% -    3%) 0.776
                            MedTerm      439.13      (3.9%)      438.94      
(4.3%)   -0.0% (  -7% -    8%) 0.974
              BrowseMonthSSDVFacets        3.48      (1.8%)        3.48      
(1.5%)    0.0% (  -3% -    3%) 0.949
                            Respell       37.97      (2.9%)       37.99      
(2.7%)    0.1% (  -5% -    5%) 0.945
                            LowTerm      323.73      (4.2%)      323.93      
(3.9%)    0.1% (  -7% -    8%) 0.960
                             Fuzzy1       45.01      (2.5%)       45.06      
(2.6%)    0.1% (  -4% -    5%) 0.896
                           PKLookup      154.24      (2.6%)      154.41      
(2.1%)    0.1% (  -4% -    4%) 0.886
                           Wildcard       71.96      (1.4%)       72.04      
(1.9%)    0.1% (  -3% -    3%) 0.823
          BrowseDayOfYearSSDVFacets        3.32      (1.8%)        3.33      
(1.9%)    0.2% (  -3% -    4%) 0.789
                             IntNRQ       27.77      (8.5%)       27.84      
(9.6%)    0.3% ( -16% -   19%) 0.929
                       OrHighNotLow      183.77      (6.9%)      184.66      
(5.4%)    0.5% ( -11% -   13%) 0.805
               HighTermTitleBDVSort        4.81      (3.0%)        4.83      
(3.0%)    0.5% (  -5% -    6%) 0.566
                           HighTerm      496.37      (5.5%)      499.12      
(5.2%)    0.6% (  -9% -   11%) 0.745
                         TermDTSort      109.66      (1.4%)      110.31      
(1.3%)    0.6% (  -1% -    3%) 0.155
                   HighSloppyPhrase       23.39      (4.2%)       23.56      
(3.8%)    0.7% (  -7% -    9%) 0.580
              BrowseMonthTaxoFacets        3.88      (2.5%)        3.91      
(1.7%)    0.8% (  -3% -    5%) 0.261
          BrowseDayOfYearTaxoFacets        3.89      (2.5%)        3.92      
(0.9%)    0.8% (  -2% -    4%) 0.171
               BrowseDateTaxoFacets        3.88      (2.4%)        3.91      
(0.8%)    0.9% (  -2% -    4%) 0.117
                      OrHighNotHigh      270.96      (5.3%)      273.88      
(4.2%)    1.1% (  -7% -   11%) 0.474
                       OrHighNotMed      202.84      (6.3%)      205.09      
(4.7%)    1.1% (  -9% -   12%) 0.531
               HighIntervalsOrdered        4.36      (4.2%)        4.41      
(4.3%)    1.1% (  -7% -   10%) 0.399
                  HighTermTitleSort       20.12      (5.5%)       20.35      
(5.6%)    1.2% (  -9% -   12%) 0.511
               MedTermDayTaxoFacets        9.42      (3.2%)        9.53      
(4.3%)    1.2% (  -6% -    8%) 0.325
                         OrHighHigh       12.07      (4.9%)       12.22      
(4.9%)    1.3% (  -8% -   11%) 0.415
                       HighSpanNear        9.92      (3.5%)       10.06      
(3.3%)    1.3% (  -5% -    8%) 0.212
                      OrNotHighHigh      356.64      (4.5%)      361.84      
(3.3%)    1.5% (  -6% -    9%) 0.242
                MedIntervalsOrdered        3.09      (2.9%)        3.14      
(3.0%)    1.5% (  -4% -    7%) 0.119
             OrHighMedDayTaxoFacets        1.54      (4.6%)        1.56      
(5.3%)    1.5% (  -8% -   11%) 0.338
                       OrNotHighMed      346.18      (3.3%)      351.65      
(3.6%)    1.6% (  -5% -    8%) 0.144
                    LowSloppyPhrase       21.58      (2.6%)       21.94      
(2.0%)    1.7% (  -2% -    6%) 0.025
                          OrHighMed       48.51      (3.4%)       49.50      
(3.5%)    2.0% (  -4% -    9%) 0.062
                        MedSpanNear       26.76      (1.4%)       27.35      
(2.2%)    2.2% (  -1% -    5%) 0.000
              HighTermDayOfYearSort      223.99      (2.7%)      228.97      
(2.1%)    2.2% (  -2% -    7%) 0.004
                          OrHighLow      208.96      (3.2%)      213.65      
(3.0%)    2.2% (  -3% -    8%) 0.022
                        AndHighHigh       13.93      (3.6%)       14.25      
(3.7%)    2.4% (  -4% -   10%) 0.042
                    MedSloppyPhrase       19.45      (2.5%)       19.92      
(2.0%)    2.4% (  -1% -    7%) 0.001
           AndHighHighDayTaxoFacets        2.12      (3.9%)        2.17      
(5.2%)    2.4% (  -6% -   11%) 0.094
                        LowSpanNear        8.99      (1.6%)        9.24      
(1.4%)    2.7% (   0% -    5%) 0.000
                LowIntervalsOrdered        7.46      (2.6%)        7.67      
(3.1%)    2.8% (  -2% -    8%) 0.002
                         HighPhrase       12.54      (2.8%)       12.91      
(3.3%)    2.9% (  -3% -    9%) 0.003
                          MedPhrase       25.45      (1.8%)       26.22      
(2.1%)    3.0% (   0% -    6%) 0.000
            AndHighMedDayTaxoFacets       16.49      (1.7%)       17.08      
(2.3%)    3.6% (   0% -    7%) 0.000
                         AndHighMed       91.80      (2.2%)       95.26      
(2.3%)    3.8% (   0% -    8%) 0.000
                          LowPhrase      144.04      (1.5%)      150.03      
(2.0%)    4.2% (   0% -    7%) 0.000
                       OrNotHighLow      315.83      (1.8%)      330.04      
(2.6%)    4.5% (   0% -    9%) 0.000
                         AndHighLow      352.57      (2.7%)      379.44      
(3.7%)    7.6% (   1% -   14%) 0.000
   ```
   The tasks at the bottom of the table had the larger QPS improvement.
   AndHighLow has a QPS improvement of +7.6%!
   
   Size of test candidate index:   17.626 GiB   total
   Size of test baseline index:   18.401 GiB    total
   
   This change would bring a 4.39691% increase in index size.
   
   ### Proposal
   I propose adding an option to Lucene's codec that allows users to disable 
patching in the PFOR encoding, providing the option to leverage the performance 
benefits observed here at a cost of index size.
   I would appreciate all feedback and further evaluation of this idea by the 
Lucene community.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[I] Adding option to codec to disable patching in Lucene's PFOR encoding [lucene]

Reply via email to