[jira] [Commented] (LUCENE-10033) Encode doc values in smaller blocks of values, like postings

Adrien Grand (Jira) Tue, 27 Jul 2021 09:49:12 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-10033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17388178#comment-17388178
 ]


Adrien Grand commented on LUCENE-10033:
---------------------------------------

I opened a PR with this idea. Queries that consume most values like the Browse* 
faceting tasks become faster, but queries that only consume a small subset of 
values like some sorting tasks (not all, on of them is faster) become slower.

{noformat}
                    TaskQPS baseline      StdDev   QPS patch      StdDev        
        Pct diff p-value
       HighTermMonthSort      101.33      (9.7%)       51.93      (2.8%)  
-48.7% ( -55% -  -40%) 0.000
              TermDTSort      587.24      (6.1%)      404.20      (2.9%)  
-31.2% ( -37% -  -23%) 0.000
                  IntNRQ       85.55     (14.7%)       73.16      (1.6%)  
-14.5% ( -26% -    2%) 0.000
            OrHighNotMed     1301.37      (3.7%)     1218.64      (2.3%)   
-6.4% ( -11% -    0%) 0.000
           OrNotHighHigh     1121.91      (4.1%)     1089.27      (2.7%)   
-2.9% (  -9% -    4%) 0.008
                 MedTerm     2156.71      (3.3%)     2103.32      (3.6%)   
-2.5% (  -9% -    4%) 0.022
                  Fuzzy2       67.41      (4.6%)       65.74      (4.9%)   
-2.5% ( -11% -    7%) 0.098
            OrNotHighLow     1099.66      (3.7%)     1078.60      (3.0%)   
-1.9% (  -8% -    4%) 0.073
     MedIntervalsOrdered       79.39      (3.0%)       77.94      (3.7%)   
-1.8% (  -8% -    5%) 0.088
               MedPhrase      403.62      (2.8%)      397.19      (2.3%)   
-1.6% (  -6% -    3%) 0.050
               OrHighMed      130.57      (3.0%)      128.64      (2.6%)   
-1.5% (  -6% -    4%) 0.099
     LowIntervalsOrdered       20.82      (2.5%)       20.55      (3.4%)   
-1.3% (  -6% -    4%) 0.167
    HighIntervalsOrdered        2.95      (5.1%)        2.91      (5.8%)   
-1.1% ( -11% -   10%) 0.530
               OrHighLow      579.45      (2.9%)      574.45      (2.4%)   
-0.9% (  -5% -    4%) 0.306
             LowSpanNear       33.20      (2.9%)       33.06      (3.5%)   
-0.4% (  -6% -    6%) 0.668
            HighSpanNear        9.79      (3.5%)        9.79      (3.7%)   
-0.0% (  -7% -    7%) 0.996
                 Respell      221.47      (2.1%)      221.62      (2.8%)    
0.1% (  -4% -    4%) 0.931
        HighSloppyPhrase       36.64      (3.4%)       36.69      (4.0%)    
0.1% (  -7% -    7%) 0.915
                Wildcard      283.85      (6.5%)      285.06      (7.2%)    
0.4% ( -12% -   15%) 0.845
         LowSloppyPhrase      175.77      (4.3%)      176.56      (4.4%)    
0.5% (  -7% -    9%) 0.740
             AndHighHigh       64.34      (2.5%)       64.84      (3.4%)    
0.8% (  -5% -    6%) 0.410
                HighTerm     2146.56      (3.3%)     2164.26      (4.5%)    
0.8% (  -6% -    8%) 0.505
    HighTermTitleBDVSort       27.18      (4.6%)       27.41      (2.1%)    
0.8% (  -5% -    7%) 0.461
            OrHighNotLow     1261.38      (2.3%)     1274.89      (3.0%)    
1.1% (  -4% -    6%) 0.210
             MedSpanNear       26.96      (4.1%)       27.28      (3.5%)    
1.2% (  -6% -    9%) 0.336
         MedSloppyPhrase      102.18      (4.7%)      103.51      (5.1%)    
1.3% (  -8% -   11%) 0.399
    BrowseDateTaxoFacets        3.15      (4.0%)        3.19      (4.0%)    
1.4% (  -6% -    9%) 0.281
BrowseDayOfYearTaxoFacets        3.15      (4.0%)        3.20      (4.0%)    
1.5% (  -6% -    9%) 0.250
              AndHighLow     1295.59      (3.3%)     1318.11      (3.4%)    
1.7% (  -4% -    8%) 0.105
                 Prefix3       63.21     (15.4%)       64.49     (17.1%)    
2.0% ( -26% -   40%) 0.694
              OrHighHigh       35.41      (3.1%)       36.24      (3.1%)    
2.4% (  -3% -    8%) 0.015
                  Fuzzy1      253.74      (6.1%)      260.89      (7.1%)    
2.8% (  -9% -   16%) 0.175
   BrowseMonthTaxoFacets        3.42      (7.7%)        3.52      (4.1%)    
2.9% (  -8% -   15%) 0.135
              AndHighMed      164.48      (2.6%)      169.43      (3.3%)    
3.0% (  -2% -    9%) 0.001
                 LowTerm     2645.26      (4.9%)     2752.43      (5.6%)    
4.1% (  -6% -   15%) 0.015
           OrHighNotHigh     1286.12      (3.7%)     1349.66      (4.6%)    
4.9% (  -3% -   13%) 0.000
              HighPhrase      105.61      (3.7%)      111.65      (4.8%)    
5.7% (  -2% -   14%) 0.000
               LowPhrase       35.85      (2.6%)       38.76      (3.3%)    
8.1% (   2% -   14%) 0.000
            OrNotHighMed     1241.35      (3.1%)     1368.49      (3.6%)   
10.2% (   3% -   17%) 0.000
   HighTermDayOfYearSort      573.92      (9.5%)      687.19      (7.9%)   
19.7% (   2% -   40%) 0.000
   BrowseMonthSSDVFacets       11.52      (5.1%)       17.81     (23.5%)   
54.6% (  24% -   87%) 0.000
BrowseDayOfYearSSDVFacets       11.24      (3.9%)       18.15     (23.1%)   
61.4% (  33% -   91%) 0.000
{noformat}

> Encode doc values in smaller blocks of values, like postings
> ------------------------------------------------------------
>
>                 Key: LUCENE-10033
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10033
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Priority: Minor
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> This is a follow-up to the discussion on this thread: 
> https://lists.apache.org/thread.html/r7b757074d5f02874ce3a295b0007dff486bc10d08fb0b5e5a4ba72c5%40%3Cdev.lucene.apache.org%3E.
> Our current approach for doc values uses large blocks of 16k values where 
> values can be decompressed independently, using DirectWriter/DirectReader. 
> This is a bit inefficient in some cases, e.g. a single outlier can grow the 
> number of bits per value for the entire block, we can't easily use run-length 
> compression, etc. Plus, it encourages using a different sub-class for every 
> compression technique, which puts pressure on the JVM.
> We'd like to move to an approach that would be more similar to postings with 
> smaller blocks (e.g. 128 values) whose values get all decompressed at once 
> (using SIMD instructions), with skip data within blocks in order to 
> efficiently skip to arbitrary doc IDs (or maybe still use jump tables as 
> today's doc values, and as discussed here for postings: 
> https://lists.apache.org/thread.html/r7c3cb7ab143fd4ecbc05c04064d10ef9fb50c5b4d6479b0f35732677%40%3Cdev.lucene.apache.org%3E).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10033) Encode doc values in smaller blocks of values, like postings

Reply via email to