[jira] (LUCENE-10334) Introduce a BlockReader based on ForUtil and use it for NumericDocValues

Feng Guo (Jira) Wed, 29 Dec 2021 23:30:07 -0800


    [ https://issues.apache.org/jira/browse/LUCENE-10334 ]



    Feng Guo deleted comment on LUCENE-10334:
    -----------------------------------

was (Author: gf2121):
If we can not tolerate the regression, another idea coming to my mind to solve 
the regression is introducing a 'detect warm up' phase for 
{{{}DirectReader{}}}. As most of the usage of DirectReader in DocvaluesProducer 
is a forward reading, we can probably judge hits is dense/sparse by first 128 
#get, e.g. we can assume the reading is dense if we get more than 80% times in 
the first block, and choose block decoding for following gets if dense.

Here is the POC code: [https://github.com/apache/lucene/pull/570] and benchmark 
result:
{code:java}
                            TaskQPS baseline      StdDevQPS my_modified_version 
     StdDev                Pct diff p-value
          OrHighMedDayTaxoFacets       12.08      (5.6%)       11.85      
(4.4%)   -1.9% ( -11% -    8%) 0.228
            MedTermDayTaxoFacets       35.50      (2.9%)       35.09      
(2.1%)   -1.2% (  -5% -    3%) 0.148
        AndHighHighDayTaxoFacets       20.35      (2.5%)       20.18      
(2.2%)   -0.8% (  -5% -    4%) 0.275
           BrowseMonthTaxoFacets       14.09     (12.4%)       13.99      
(7.2%)   -0.7% ( -18% -   21%) 0.817
         AndHighMedDayTaxoFacets      100.43      (2.2%)       99.96      
(2.2%)   -0.5% (  -4% -    3%) 0.501
             LowIntervalsOrdered       31.96      (3.6%)       31.90      
(2.7%)   -0.2% (  -6% -    6%) 0.853
            HighIntervalsOrdered        9.82      (4.8%)        9.81      
(3.8%)   -0.1% (  -8% -    8%) 0.925
           HighTermDayOfYearSort       58.36      (8.2%)       58.29      
(7.2%)   -0.1% ( -14% -   16%) 0.962
             MedIntervalsOrdered       16.33      (3.3%)       16.33      
(2.5%)   -0.0% (  -5% -    6%) 0.967
            HighTermTitleBDVSort       82.38     (11.9%)       82.52     
(13.2%)    0.2% ( -22% -   28%) 0.966
                    HighSpanNear       38.08      (1.9%)       38.17      
(1.5%)    0.2% (  -3% -    3%) 0.687
                     AndHighHigh       73.02      (4.1%)       73.20      
(4.4%)    0.2% (  -7% -    9%) 0.854
                      OrHighHigh       38.67      (2.1%)       38.77      
(1.9%)    0.3% (  -3% -    4%) 0.669
                 LowSloppyPhrase       48.05      (5.4%)       48.20      
(5.5%)    0.3% ( -10% -   11%) 0.856
                 MedSloppyPhrase       34.55      (2.7%)       34.66      
(2.6%)    0.3% (  -4% -    5%) 0.696
                      TermDTSort      200.08     (11.2%)      200.74     
(11.3%)    0.3% ( -19% -   25%) 0.926
               HighTermMonthSort      126.69     (11.4%)      127.18     
(11.7%)    0.4% ( -20% -   26%) 0.917
                HighSloppyPhrase       14.03      (3.5%)       14.09      
(3.7%)    0.4% (  -6% -    7%) 0.703
                     MedSpanNear      103.61      (2.1%)      104.14      
(1.2%)    0.5% (  -2% -    3%) 0.332
                          IntNRQ      126.16      (2.3%)      126.81      
(2.7%)    0.5% (  -4% -    5%) 0.508
                      AndHighMed      164.27      (4.2%)      165.20      
(4.4%)    0.6% (  -7% -    9%) 0.676
                     LowSpanNear      167.58      (2.7%)      168.63      
(2.6%)    0.6% (  -4% -    6%) 0.460
                        PKLookup      201.62      (3.8%)      203.05      
(4.7%)    0.7% (  -7% -    9%) 0.599
                         Respell       73.56      (2.1%)       74.43      
(2.7%)    1.2% (  -3% -    6%) 0.121
                       MedPhrase      266.51      (5.2%)      270.42      
(5.9%)    1.5% (  -9% -   13%) 0.405
                       OrHighMed      116.57      (4.0%)      118.30      
(3.3%)    1.5% (  -5% -    9%) 0.202
                         Prefix3      136.44      (3.9%)      138.51      
(3.6%)    1.5% (  -5% -    9%) 0.204
                    OrNotHighMed      669.05      (5.3%)      679.79      
(7.7%)    1.6% ( -10% -   15%) 0.443
                    OrNotHighLow      907.93      (5.8%)      922.66     
(10.1%)    1.6% ( -13% -   18%) 0.533
                        Wildcard      146.59      (3.2%)      149.19      
(4.9%)    1.8% (  -6% -   10%) 0.172
                       OrHighLow      383.74      (8.5%)      390.67      
(8.0%)    1.8% ( -13% -   20%) 0.489
                      HighPhrase       96.06      (4.4%)       97.81      
(6.8%)    1.8% (  -8% -   13%) 0.316
                          Fuzzy2       65.58     (12.9%)       66.81     
(11.3%)    1.9% ( -19% -   29%) 0.624
                       LowPhrase      145.74      (4.0%)      148.50      
(5.1%)    1.9% (  -6% -   11%) 0.192
                         MedTerm     1470.64      (7.1%)     1498.96      
(9.5%)    1.9% ( -13% -   19%) 0.468
                   OrHighNotHigh      562.56      (5.7%)      573.78      
(7.3%)    2.0% ( -10% -   15%) 0.336
                          Fuzzy1       95.47      (5.7%)       97.51      
(7.3%)    2.1% ( -10% -   16%) 0.303
                    OrHighNotMed      680.95      (6.1%)      696.12      
(9.7%)    2.2% ( -12% -   19%) 0.384
                        HighTerm     1121.76      (5.6%)     1149.67      
(8.4%)    2.5% ( -10% -   17%) 0.270
                    OrHighNotLow      913.24      (6.9%)      939.67     
(12.4%)    2.9% ( -15% -   23%) 0.362
                      AndHighLow      681.76      (6.2%)      702.39      
(9.2%)    3.0% ( -11% -   19%) 0.224
                         LowTerm     1340.75      (7.4%)     1384.34      
(8.9%)    3.3% ( -12% -   21%) 0.210
                   OrNotHighHigh      568.63      (4.9%)      587.28      
(8.7%)    3.3% (  -9% -   17%) 0.142
            BrowseDateTaxoFacets       13.40      (9.6%)       14.66     
(19.7%)    9.4% ( -18% -   42%) 0.055
     BrowseRandomLabelTaxoFacets       11.75      (7.8%)       12.86     
(16.4%)    9.5% ( -13% -   36%) 0.020
       BrowseDayOfYearTaxoFacets       13.49      (9.9%)       14.81     
(20.0%)    9.8% ( -18% -   44%) 0.051
           BrowseMonthSSDVFacets       15.72      (0.8%)       18.03      
(4.6%)   14.7% (   9% -   20%) 0.000
     BrowseRandomLabelSSDVFacets       10.39      (1.4%)       11.93      
(3.5%)   14.8% (   9% -   19%) 0.000
       BrowseDayOfYearSSDVFacets       14.29      (0.9%)       17.75      
(4.8%)   24.2% (  18% -   30%) 0.000
{code}
*Advantages*
1. No need to change file format.
2. This is nearly a net win without hurting any situation (at least luceneutil 
tells so :) )

*Disadvantages*
1. We can not benefit from the SIMD optimizations in {{ForUtil}}

> Introduce a BlockReader based on ForUtil and use it for NumericDocValues
> ------------------------------------------------------------------------
>
>                 Key: LUCENE-10334
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10334
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/codecs
>            Reporter: Feng Guo
>            Priority: Major
>          Time Spent: 50m
>  Remaining Estimate: 0h
>
> Previous talk is here: [https://github.com/apache/lucene/pull/557]
> This is trying to add a new BlockReader based on ForUtil to replace the 
> DirectReader we are using for NumericDocvalues
> -*Benchmark based on wiki10m*- (Previous benchmark results are wrong so i 
> deleted it to avoid misleading, let's see the benchmark in comments.)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] (LUCENE-10334) Introduce a BlockReader based on ForUtil and use it for NumericDocValues

Reply via email to