[jira] [Commented] (LUCENE-10033) Encode doc values in smaller blocks of values, like postings

weizijun (Jira) Fri, 27 Aug 2021 01:13:06 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-10033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17405670#comment-17405670
 ]


weizijun commented on LUCENE-10033:
-----------------------------------

hi, [~gsmiller] . Here is the wikimedium10m result:

{noformat}
                    TaskQPS baseline      StdDevQPS my_modified_version      
StdDev                Pct diff p-value
   BrowseMonthSSDVFacets       12.05     (13.9%)        5.32      (2.0%)  
-55.9% ( -63% -  -46%) 0.000
BrowseDayOfYearSSDVFacets       10.80     (13.1%)        5.10      (2.5%)  
-52.8% ( -60% -  -42%) 0.000
              TermDTSort      111.11     (13.4%)      109.05     (10.5%)   
-1.9% ( -22% -   25%) 0.625
                HighTerm      927.28      (4.1%)      913.16      (3.0%)   
-1.5% (  -8% -    5%) 0.184
                 MedTerm     1043.87      (5.7%)     1029.65      (3.5%)   
-1.4% (  -9% -    8%) 0.361
                Wildcard      248.11      (2.4%)      244.81      (3.2%)   
-1.3% (  -6% -    4%) 0.136
            OrNotHighMed      514.00      (2.7%)      508.38      (2.6%)   
-1.1% (  -6% -    4%) 0.188
                 LowTerm     1230.06      (4.3%)     1219.21      (3.4%)   
-0.9% (  -8% -    7%) 0.475
             AndHighHigh       52.82      (4.6%)       52.36      (3.8%)   
-0.9% (  -8% -    7%) 0.515
              HighPhrase      117.84      (2.9%)      117.33      (1.7%)   
-0.4% (  -4% -    4%) 0.558
               MedPhrase       71.85      (2.7%)       71.55      (1.9%)   
-0.4% (  -4% -    4%) 0.568
            OrHighNotMed      504.15      (4.5%)      502.33      (3.0%)   
-0.4% (  -7% -    7%) 0.764
       HighTermMonthSort      138.89      (9.3%)      138.40     (11.8%)   
-0.4% ( -19% -   22%) 0.916
                 Prefix3      184.76      (3.5%)      184.20      (2.7%)   
-0.3% (  -6% -    6%) 0.757
                  IntNRQ       87.44      (0.8%)       87.25      (0.8%)   
-0.2% (  -1% -    1%) 0.394
              AndHighMed      154.81      (3.1%)      154.48      (2.5%)   
-0.2% (  -5% -    5%) 0.816
BrowseDayOfYearTaxoFacets        2.35      (4.2%)        2.35      (3.9%)   
-0.1% (  -7% -    8%) 0.911
              AndHighLow      379.69      (3.7%)      379.19      (3.7%)   
-0.1% (  -7% -    7%) 0.911
   BrowseMonthTaxoFacets        2.49      (4.6%)        2.49      (4.3%)   
-0.1% (  -8% -    9%) 0.928
    BrowseDateTaxoFacets        2.35      (4.3%)        2.35      (3.9%)   
-0.1% (  -7% -    8%) 0.960
              OrHighHigh       18.57      (2.5%)       18.56      (1.8%)   
-0.1% (  -4% -    4%) 0.932
     MedIntervalsOrdered       48.37      (4.0%)       48.36      (4.0%)   
-0.0% (  -7% -    8%) 0.987
    HighTermTitleBDVSort       91.07     (10.3%)       91.13     (11.8%)    
0.1% ( -20% -   24%) 0.985
        HighSloppyPhrase       27.39      (4.5%)       27.42      (3.2%)    
0.1% (  -7% -    8%) 0.931
    HighIntervalsOrdered       20.94      (3.6%)       20.96      (2.8%)    
0.1% (  -6% -    6%) 0.907
           OrHighNotHigh      431.17      (3.5%)      431.76      (2.7%)    
0.1% (  -5% -    6%) 0.889
         MedSloppyPhrase       16.30      (4.7%)       16.33      (3.3%)    
0.2% (  -7% -    8%) 0.876
     LowIntervalsOrdered      179.07      (3.4%)      179.65      (2.5%)    
0.3% (  -5% -    6%) 0.734
               LowPhrase      278.39      (2.6%)      279.34      (2.6%)    
0.3% (  -4% -    5%) 0.674
           OrNotHighHigh      421.04      (4.1%)      422.68      (4.1%)    
0.4% (  -7% -    8%) 0.762
            HighSpanNear       10.97      (2.6%)       11.01      (2.7%)    
0.4% (  -4% -    5%) 0.621
             LowSpanNear       32.07      (1.9%)       32.21      (2.0%)    
0.4% (  -3% -    4%) 0.490
                  Fuzzy1       51.86      (7.4%)       52.12      (7.3%)    
0.5% ( -13% -   16%) 0.834
               OrHighMed      103.63      (2.5%)      104.13      (1.7%)    
0.5% (  -3% -    4%) 0.473
         LowSloppyPhrase       93.59      (3.3%)       94.13      (2.4%)    
0.6% (  -4% -    6%) 0.518
            OrNotHighLow      413.02      (3.6%)      415.65      (3.8%)    
0.6% (  -6% -    8%) 0.585
            OrHighNotLow      514.45      (2.8%)      517.93      (3.7%)    
0.7% (  -5% -    7%) 0.516
                 Respell       50.34      (2.4%)       50.74      (2.3%)    
0.8% (  -3% -    5%) 0.281
             MedSpanNear        9.20      (4.9%)        9.29      (4.8%)    
1.0% (  -8% -   11%) 0.535
               OrHighLow      257.35      (4.2%)      260.38      (3.4%)    
1.2% (  -6% -    9%) 0.325
                  Fuzzy2       46.61     (10.2%)       47.26      (8.8%)    
1.4% ( -15% -   22%) 0.642
                PKLookup      140.43      (2.9%)      142.41      (2.4%)    
1.4% (  -3% -    6%) 0.096
   HighTermDayOfYearSort      115.09     (12.8%)      116.98     (13.2%)    
1.6% ( -21% -   31%) 0.689
{noformat}

The performance of the SSDV is lower, other cases seem to have little effect.
And the whole result is from the Attachment:  [^benchmark-10m] 

> Encode doc values in smaller blocks of values, like postings
> ------------------------------------------------------------
>
>                 Key: LUCENE-10033
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10033
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Priority: Minor
>         Attachments: benchmark, benchmark-10m
>
>          Time Spent: 1h
>  Remaining Estimate: 0h
>
> This is a follow-up to the discussion on this thread: 
> https://lists.apache.org/thread.html/r7b757074d5f02874ce3a295b0007dff486bc10d08fb0b5e5a4ba72c5%40%3Cdev.lucene.apache.org%3E.
> Our current approach for doc values uses large blocks of 16k values where 
> values can be decompressed independently, using DirectWriter/DirectReader. 
> This is a bit inefficient in some cases, e.g. a single outlier can grow the 
> number of bits per value for the entire block, we can't easily use run-length 
> compression, etc. Plus, it encourages using a different sub-class for every 
> compression technique, which puts pressure on the JVM.
> We'd like to move to an approach that would be more similar to postings with 
> smaller blocks (e.g. 128 values) whose values get all decompressed at once 
> (using SIMD instructions), with skip data within blocks in order to 
> efficiently skip to arbitrary doc IDs (or maybe still use jump tables as 
> today's doc values, and as discussed here for postings: 
> https://lists.apache.org/thread.html/r7c3cb7ab143fd4ecbc05c04064d10ef9fb50c5b4d6479b0f35732677%40%3Cdev.lucene.apache.org%3E).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10033) Encode doc values in smaller blocks of values, like postings

Reply via email to