[jira] [Commented] (LUCENE-9447) Make BEST_COMPRESSION compress more aggressively?

Adrien Grand (Jira) Wed, 19 Aug 2020 07:00:53 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-9447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17180569#comment-17180569
 ]


Adrien Grand commented on LUCENE-9447:
--------------------------------------

I opened a pull request, with this new approach. There is a lot of changed 
lines due to the ceremony of introducing a codec/format and moving the old one 
to backward-codecs, but you can really focus on the new 
Lucene87StoredFieldsFormat class to see what the patch does: it splits stored 
fields into blocks that contain an 8kB dictionary that is compressed on its own 
plus 10 sub blocks of 48kB each (so 488kB in total). Here is how it compares 
with the current codec for BEST_COMPRESSION on the JSON logs I used for testing 
in previous comments and a subset of enwiki docs:

||Dataset||Method||Index size(MB)||Index time(s)||Avg fetch time (us)||
|JSON logs|Lucene50StoredFieldsFormat/BEST_COMPRESSION|100.6|14|35|
|JSON logs|Lucene87StoredFieldsFormat/BEST_COMPRESSION|64.9|14|41|
|enwiki|Lucene50StoredFieldsFormat/BEST_COMPRESSION|338.1|34|237|
|enwiki|Lucene87StoredFieldsFormat/BEST_COMPRESSION|338.0|35|250|

So in short it doesn't make things worse on text data like enwiki but makes the 
compression ratio much better on highly compressible data while preserving 
similar indexing and fetching times.

Similar approaches might be interesting to look into for BEST_SPEED (preset 
dictionaries would be easy to support with LZ4) but I'd rather explore this 
separately.

> Make BEST_COMPRESSION compress more aggressively?
> -------------------------------------------------
>
>                 Key: LUCENE-9447
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9447
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Priority: Minor
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> The Lucene86 codec supports setting a "Mode" for stored fields compression, 
> that is either "BEST_SPEED", which translates to blocks of 16kB or 128 
> documents (whichever is hit first) compressed with LZ4, or 
> "BEST_COMPRESSION", which translates to blocks of 60kB or 512 documents 
> compressed with DEFLATE with default compression level (6).
> After looking at indices that spent most disk space on stored fields 
> recently, I noticed that there was quite some room for improvement by 
> increasing the block size even further:
> ||Block size||Stored fields size||
> |60kB|168412338|
> |128kB|130813639|
> |256kB|113587009|
> |512kB|104776378|
> |1MB|100367095|
> |2MB|98152464|
> |4MB|97034425|
> |8MB|96478746|
> For this specific dataset, I had 1M documents that each had about 2kB of 
> stored fields each and quite some redundancy.
> This makes me want to look into bumping this block size to maybe 256kB. It 
> would be interesting to re-do the experiments we did on LUCENE-6100 to see 
> how this affects the merging speed. That said I don't think it would be 
> terrible if the merging time increased a bit given that we already offer the 
> BEST_SPEED option for CPU-savvy users.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9447) Make BEST_COMPRESSION compress more aggressively?

Reply via email to