[jira] [Commented] (LUCENE-4702) Terms dictionary compression

Adrien Grand (Jira) Mon, 27 Jan 2020 09:50:16 -0800


    [ 
https://issues.apache.org/jira/browse/LUCENE-4702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17024532#comment-17024532
 ]


Adrien Grand commented on LUCENE-4702:
--------------------------------------

OK I benchmarked with multi-segment indices this time to try to better 
replicate nightly benchmarks. I opened a pull request at 
https://github.com/apache/lucene-solr/pull/1216 that:
 - removes compression of suffix lengths since it didn't help much anymay,
 - replaces LZ4 on stats by explicit run-length compression
 - only tries out LZ4 for suffix bytes if the average suffix length is > 6 to 
reduce index-time overhead since it's unlikely to meet the saving expectations 
otherwise anyway, in order to reduce index-time overhead

On wikibigall, the specialized RLE makes the tim file even smaller with this 
change (969MB vs. 996MB) and luceneutil seems to be a bit more happy:

{noformat}
                    TaskQPS baseline      StdDev   QPS patch      StdDev        
        Pct diff
                  IntNRQ      144.16      (1.2%)      143.47      (1.9%)   
-0.5% (  -3% -    2%)
            TermBGroup1M       32.04      (5.1%)       31.93      (5.1%)   
-0.4% ( -10% -   10%)
              TermDTSort       39.13      (0.9%)       39.05      (1.0%)   
-0.2% (  -2% -    1%)
             TermGroup1M       40.18      (4.0%)       40.12      (3.4%)   
-0.2% (  -7% -    7%)
           TermTitleSort      124.62      (1.9%)      124.54      (1.6%)   
-0.1% (  -3% -    3%)
       TermDayOfYearSort       88.37      (6.9%)       88.34      (7.1%)   
-0.0% ( -13% -   14%)
            TermGroup10K       28.56      (5.0%)       28.56      (4.4%)    
0.0% (  -8% -    9%)
        IntervalsOrdered        4.50      (1.1%)        4.51      (0.6%)    
0.0% (  -1% -    1%)
          TermBGroup1M1P       45.83      (4.1%)       45.85      (4.0%)    
0.0% (  -7% -    8%)
           TermMonthSort      137.33      (1.8%)      137.40      (1.3%)    
0.1% (  -2% -    3%)
             AndHighHigh       72.97      (2.8%)       73.05      (2.7%)    
0.1% (  -5% -    5%)
               OrHighMed       77.75      (2.7%)       77.85      (2.7%)    
0.1% (  -5% -    5%)
                SpanNear       10.66      (1.2%)       10.68      (1.2%)    
0.2% (  -2% -    2%)
                  Phrase       59.75      (4.9%)       59.91      (5.2%)    
0.3% (  -9% -   10%)
                    Term     1358.87      (6.8%)     1363.02      (6.1%)    
0.3% ( -11% -   14%)
        AndMedOrHighHigh       28.18      (3.0%)       28.27      (2.5%)    
0.3% (  -5% -    6%)
              OrHighHigh       18.55      (3.2%)       18.61      (2.2%)    
0.3% (  -4% -    5%)
            SloppyPhrase       19.41      (3.9%)       19.49      (3.5%)    
0.4% (  -6% -    8%)
              AndHighMed       65.81      (2.8%)       66.15      (2.4%)    
0.5% (  -4% -    5%)
         AndHighOrMedMed       36.49      (2.5%)       36.69      (1.9%)    
0.5% (  -3% -    5%)
            TermGroup100       12.19      (3.9%)       12.27      (4.0%)    
0.6% (  -7% -    8%)
                PKLookup      217.61      (3.2%)      220.39      (3.3%)    
1.3% (  -5% -    8%)
                 Prefix3      197.95      (3.3%)      202.32      (3.4%)    
2.2% (  -4% -    9%)
                Wildcard       37.78      (2.2%)       41.43      (2.8%)    
9.6% (   4% -   14%)
                  Fuzzy1       47.77      (5.5%)       53.35      (8.4%)   
11.7% (  -2% -   27%)
                  Fuzzy2       43.69      (7.5%)       49.50     (10.7%)   
13.3% (  -4% -   34%)
                 Respell       34.05      (1.6%)       41.94      (1.4%)   
23.2% (  19% -   26%)
{noformat}

I plan to commit it and see how that affects nigthly benchmarks.

> Terms dictionary compression
> ----------------------------
>
>                 Key: LUCENE-4702
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4702
>             Project: Lucene - Core
>          Issue Type: Wish
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Trivial
>         Attachments: LUCENE-4702.patch, LUCENE-4702.patch
>
>          Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> I've done a quick test with the block tree terms dictionary by replacing a 
> call to IndexOutput.writeBytes to write suffix bytes with a call to 
> LZ4.compressHC to test the peformance hit. Interestingly, search performance 
> was very good (see comparison table below) and the tim files were 14% smaller 
> (from 150432 bytes overall to 129516).
> {noformat}
>                     TaskQPS baseline      StdDevQPS compressed      StdDev    
>             Pct diff
>                   Fuzzy1      111.50      (2.0%)       78.78      (1.5%)  
> -29.4% ( -32% -  -26%)
>                   Fuzzy2       36.99      (2.7%)       28.59      (1.5%)  
> -22.7% ( -26% -  -18%)
>                  Respell      122.86      (2.1%)      103.89      (1.7%)  
> -15.4% ( -18% -  -11%)
>                 Wildcard      100.58      (4.3%)       94.42      (3.2%)   
> -6.1% ( -13% -    1%)
>                  Prefix3      124.90      (5.7%)      122.67      (4.7%)   
> -1.8% ( -11% -    9%)
>                OrHighLow      169.87      (6.8%)      167.77      (8.0%)   
> -1.2% ( -15% -   14%)
>                  LowTerm     1949.85      (4.5%)     1929.02      (3.4%)   
> -1.1% (  -8% -    7%)
>               AndHighLow     2011.95      (3.5%)     1991.85      (3.3%)   
> -1.0% (  -7% -    5%)
>               OrHighHigh      155.63      (6.7%)      154.12      (7.9%)   
> -1.0% ( -14% -   14%)
>              AndHighHigh      341.82      (1.2%)      339.49      (1.7%)   
> -0.7% (  -3% -    2%)
>                OrHighMed      217.55      (6.3%)      216.16      (7.1%)   
> -0.6% ( -13% -   13%)
>                   IntNRQ       53.10     (10.9%)       52.90      (8.6%)   
> -0.4% ( -17% -   21%)
>                  MedTerm      998.11      (3.8%)      994.82      (5.6%)   
> -0.3% (  -9% -    9%)
>              MedSpanNear       60.50      (3.7%)       60.36      (4.8%)   
> -0.2% (  -8% -    8%)
>             HighSpanNear       19.74      (4.5%)       19.72      (5.1%)   
> -0.1% (  -9% -    9%)
>              LowSpanNear      101.93      (3.2%)      101.82      (4.4%)   
> -0.1% (  -7% -    7%)
>               AndHighMed      366.18      (1.7%)      366.93      (1.7%)    
> 0.2% (  -3% -    3%)
>                 PKLookup      237.28      (4.0%)      237.96      (4.2%)    
> 0.3% (  -7% -    8%)
>                MedPhrase      173.17      (4.7%)      174.69      (4.7%)    
> 0.9% (  -8% -   10%)
>          LowSloppyPhrase      180.91      (2.6%)      182.79      (2.7%)    
> 1.0% (  -4% -    6%)
>                LowPhrase      374.64      (5.5%)      379.11      (5.8%)    
> 1.2% (  -9% -   13%)
>                 HighTerm      253.14      (7.9%)      256.97     (11.4%)    
> 1.5% ( -16% -   22%)
>               HighPhrase       19.52     (10.6%)       19.83     (11.0%)    
> 1.6% ( -18% -   25%)
>          MedSloppyPhrase      141.90      (2.6%)      144.11      (2.5%)    
> 1.6% (  -3% -    6%)
>         HighSloppyPhrase       25.26      (4.8%)       25.97      (5.0%)    
> 2.8% (  -6% -   13%)
> {noformat}
> Only queries which are very terms-dictionary-intensive got a performance hit 
> (Fuzzy, Fuzzy2, Respell, Wildcard), other queries including Prefix3 behaved 
> (surprisingly) well.
> Do you think of it as something worth exploring?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4702) Terms dictionary compression

Reply via email to