[jira] [Commented] (LUCENE-4702) Terms dictionary compression

Adrien Grand (Jira) Thu, 26 Dec 2019 05:42:21 -0800


    [ 
https://issues.apache.org/jira/browse/LUCENE-4702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17003640#comment-17003640
 ]


Adrien Grand commented on LUCENE-4702:
--------------------------------------

I finally explored a different path: JDK13 added more auto-vectorization 
optimizations on byte[] arrays, so I wanted to look into whether we could 
leverage it for compression. I ended up with a few lines of code that can 
encode/decode byte[] arrays with a compression ratio of ~75%, if most (there is 
support for exceptions) bytes are either in the [0x1F,0x3F) or [0x5F,0x7F) 
ranges, which notably include all digits, lowercase characters, '.', '-' and 
'_'. So it should be applicable most of the time to terms dictionaries of 
analyzed content. It already helps on our nightly benchmarks, even though very 
little normalization is performed (e.g. no ascii folding). It is usually faster 
than LZ4 for short sequences of text (several times faster on JDK13+, and a bit 
faster on previous JDKs), like our blocks of suffixes.

LZ4's ability to remove duplicate strings is still helpful, but since it hurts 
multi-term queries I only enabled it when it yields compression ratios that are 
less than 75%.

I got the following results on a force-merged wikibigall (note that results are 
not comparable at all with previous results on this issue, since this is a 
different dataset and that there have been many other changes in Lucene that 
affect theses benchmarks, especially the fact that benchmarks now only count 
1,000 hits):
{noformat}
                    TaskQPS baseline      StdDev   QPS patch      StdDev        
        Pct diff
                 Respell      164.33      (6.7%)      140.08      (4.3%)  
-14.8% ( -24% -   -4%)
                  Fuzzy2      108.19      (7.7%)      101.51      (6.6%)   
-6.2% ( -19% -    8%)
                Wildcard       94.23      (2.8%)       88.42      (2.6%)   
-6.2% ( -11% -    0%)
                 Prefix3      247.07      (5.1%)      244.95      (4.0%)   
-0.9% (  -9% -    8%)
            TermBGroup1M       24.38      (6.4%)       24.17      (6.3%)   
-0.8% ( -12% -   12%)
             TermGroup1M       23.12      (6.6%)       23.02      (6.0%)   
-0.4% ( -12% -   13%)
             AndHighHigh       35.88      (4.8%)       35.78      (5.0%)   
-0.3% (  -9% -    9%)
            TermGroup10K       45.63      (5.7%)       45.53      (5.4%)   
-0.2% ( -10% -   11%)
                SpanNear       10.89      (1.4%)       10.87      (1.5%)   
-0.2% (  -3% -    2%)
            SloppyPhrase       19.57      (4.1%)       19.54      (4.1%)   
-0.1% (  -8% -    8%)
                  Phrase       69.13      (3.5%)       69.05      (3.9%)   
-0.1% (  -7% -    7%)
              AndHighMed       50.75      (4.6%)       50.70      (4.6%)   
-0.1% (  -8% -    9%)
        IntervalsOrdered       23.97      (0.8%)       23.96      (0.6%)   
-0.0% (  -1% -    1%)
                    Term     1432.69      (3.8%)     1432.25      (3.7%)   
-0.0% (  -7% -    7%)
         AndHighOrMedMed       37.71      (1.7%)       37.72      (1.7%)    
0.0% (  -3% -    3%)
          TermBGroup1M1P       25.61      (3.4%)       25.62      (3.4%)    
0.1% (  -6% -    7%)
              TermDTSort       41.04      (4.9%)       41.06      (4.6%)    
0.1% (  -9% -   10%)
               OrHighMed       35.05      (3.2%)       35.08      (3.4%)    
0.1% (  -6% -    6%)
        AndMedOrHighHigh       34.22      (3.5%)       34.26      (3.7%)    
0.1% (  -6% -    7%)
       TermDayOfYearSort       93.34      (7.6%)       93.60      (7.2%)    
0.3% ( -13% -   16%)
            TermGroup100       15.21      (3.1%)       15.27      (3.0%)    
0.4% (  -5% -    6%)
           TermMonthSort       49.27      (2.7%)       49.53      (2.3%)    
0.5% (  -4% -    5%)
           TermTitleSort      127.41      (2.8%)      128.12      (2.2%)    
0.6% (  -4% -    5%)
              OrHighHigh       10.14      (3.3%)       10.20      (3.5%)    
0.6% (  -5% -    7%)
                  Fuzzy1      159.76      (8.2%)      161.68      (6.6%)    
1.2% ( -12% -   17%)
                  IntNRQ      266.89      (8.8%)      280.44     (11.6%)    
5.1% ( -14% -   27%)
{noformat}
The hit on {{Respell}} is significant, but on other multi-term queries it looks 
reasonable to me. It gave a ~9.3% reduction of the {{tim}} file, from 937MB to 
850MB. Here are the detailed stats for the "body" field:
{noformat}
  index FST:
    72 bytes
  terms:
    46916528 terms
    595069147 bytes (12.7 bytes/term)
  blocks:
    1507239 blocks
    1158537 terms-only blocks
    471 sub-block-only blocks
    348231 mixed blocks
    318391 floor blocks
    491775 non-floor blocks
    1015464 floor sub-blocks
    359890173 term suffix bytes before compression (196.4 suffix-bytes/block)
    296029380 compressed term suffix bytes (0.82 compression ratio - 
compression count by algorithm: uncompressed:225133, lowercase_ascii:1217151, 
LZ4:64955)
    94426201 term stats bytes (62.6 stats-bytes/block)
    236025336 other bytes (156.6 other-bytes/block)
    by prefix length:
       0: 4
       1: 403
       2: 12500
       3: 135458
       4: 214723
       5: 445741
       6: 279299
       7: 120403
       8: 95046
       9: 65611
      10: 42914
      11: 25225
      12: 15910
      13: 8865
      14: 9029
      15: 13485
      16: 10549
      17: 3412
      18: 1234
      19: 1003
      20: 1197
      21: 753
      22: 436
      23: 510
      24: 328
      25: 494
      26: 396
      27: 723
      28: 246
      29: 310
      30: 103
      31: 60
      32: 58
      33: 36
      34: 61
      35: 83
      36: 118
      37: 44
      38: 48
      39: 81
      40: 16
      41: 29
      42: 12
      43: 12
      44: 44
      45: 16
      46: 54
      47: 18
      48: 10
      49: 5
      50: 6
      51: 2
      52: 4
      53: 13
      55: 2
      56: 11
      57: 6
      58: 7
      59: 8
      60: 2
      61: 11
      62: 8
      63: 8
      64: 4
      65: 5
      66: 7
      67: 4
      68: 1
      69: 1
      70: 4
      73: 2
      74: 1
      76: 1
      77: 1
      78: 2
      79: 2
      81: 1
{noformat}
When I simulate 1M flake IDs with a 1000 docs/s indexing rate, I get the 
following stats
{noformat}
  index FST:
    134007 bytes
  terms:
    1000000 terms
    16000000 bytes (16.0 bytes/term)
  blocks:
    39215 blocks
    39062 terms-only blocks
    153 sub-block-only blocks
    0 mixed blocks
    3923 floor blocks
    1 non-floor blocks
    39214 floor sub-blocks
    10019627 term suffix bytes before compression (165.6 suffix-bytes/block)
    6492123 compressed term suffix bytes (0.65 compression ratio - compression 
count by algorithm: uncompressed:137, lowercase_ascii:15, LZ4:39063)
    1000000 term stats bytes (25.5 stats-bytes/block)
    4101135 other bytes (104.6 other-bytes/block)
    by prefix length:
       0: 1
       6: 152
       7: 39062
{noformat}

> Terms dictionary compression
> ----------------------------
>
>                 Key: LUCENE-4702
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4702
>             Project: Lucene - Core
>          Issue Type: Wish
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Trivial
>         Attachments: LUCENE-4702.patch, LUCENE-4702.patch
>
>
> I've done a quick test with the block tree terms dictionary by replacing a 
> call to IndexOutput.writeBytes to write suffix bytes with a call to 
> LZ4.compressHC to test the peformance hit. Interestingly, search performance 
> was very good (see comparison table below) and the tim files were 14% smaller 
> (from 150432 bytes overall to 129516).
> {noformat}
>                     TaskQPS baseline      StdDevQPS compressed      StdDev    
>             Pct diff
>                   Fuzzy1      111.50      (2.0%)       78.78      (1.5%)  
> -29.4% ( -32% -  -26%)
>                   Fuzzy2       36.99      (2.7%)       28.59      (1.5%)  
> -22.7% ( -26% -  -18%)
>                  Respell      122.86      (2.1%)      103.89      (1.7%)  
> -15.4% ( -18% -  -11%)
>                 Wildcard      100.58      (4.3%)       94.42      (3.2%)   
> -6.1% ( -13% -    1%)
>                  Prefix3      124.90      (5.7%)      122.67      (4.7%)   
> -1.8% ( -11% -    9%)
>                OrHighLow      169.87      (6.8%)      167.77      (8.0%)   
> -1.2% ( -15% -   14%)
>                  LowTerm     1949.85      (4.5%)     1929.02      (3.4%)   
> -1.1% (  -8% -    7%)
>               AndHighLow     2011.95      (3.5%)     1991.85      (3.3%)   
> -1.0% (  -7% -    5%)
>               OrHighHigh      155.63      (6.7%)      154.12      (7.9%)   
> -1.0% ( -14% -   14%)
>              AndHighHigh      341.82      (1.2%)      339.49      (1.7%)   
> -0.7% (  -3% -    2%)
>                OrHighMed      217.55      (6.3%)      216.16      (7.1%)   
> -0.6% ( -13% -   13%)
>                   IntNRQ       53.10     (10.9%)       52.90      (8.6%)   
> -0.4% ( -17% -   21%)
>                  MedTerm      998.11      (3.8%)      994.82      (5.6%)   
> -0.3% (  -9% -    9%)
>              MedSpanNear       60.50      (3.7%)       60.36      (4.8%)   
> -0.2% (  -8% -    8%)
>             HighSpanNear       19.74      (4.5%)       19.72      (5.1%)   
> -0.1% (  -9% -    9%)
>              LowSpanNear      101.93      (3.2%)      101.82      (4.4%)   
> -0.1% (  -7% -    7%)
>               AndHighMed      366.18      (1.7%)      366.93      (1.7%)    
> 0.2% (  -3% -    3%)
>                 PKLookup      237.28      (4.0%)      237.96      (4.2%)    
> 0.3% (  -7% -    8%)
>                MedPhrase      173.17      (4.7%)      174.69      (4.7%)    
> 0.9% (  -8% -   10%)
>          LowSloppyPhrase      180.91      (2.6%)      182.79      (2.7%)    
> 1.0% (  -4% -    6%)
>                LowPhrase      374.64      (5.5%)      379.11      (5.8%)    
> 1.2% (  -9% -   13%)
>                 HighTerm      253.14      (7.9%)      256.97     (11.4%)    
> 1.5% ( -16% -   22%)
>               HighPhrase       19.52     (10.6%)       19.83     (11.0%)    
> 1.6% ( -18% -   25%)
>          MedSloppyPhrase      141.90      (2.6%)      144.11      (2.5%)    
> 1.6% (  -3% -    6%)
>         HighSloppyPhrase       25.26      (4.8%)       25.97      (5.0%)    
> 2.8% (  -6% -   13%)
> {noformat}
> Only queries which are very terms-dictionary-intensive got a performance hit 
> (Fuzzy, Fuzzy2, Respell, Wildcard), other queries including Prefix3 behaved 
> (surprisingly) well.
> Do you think of it as something worth exploring?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4702) Terms dictionary compression

Reply via email to