[ 
https://issues.apache.org/jira/browse/LUCENE-9447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17176411#comment-17176411
 ] 

Adrien Grand commented on LUCENE-9447:
--------------------------------------

I collected some data using various block sizes and a preset dict that consists 
of the first 32kB of the dataset like on LUCENE-6100. The dataset consists of 
1M highly compressible doucments whose total uncompressed size is 2.2GB.

||Block size||Preset dict||Stored fields size||Index time||Merge time||
|BEST_SPEED|No|304MB|9s|200ms|
|60kB|No|100.6MB|14s|70ms|
|60kB|Yes|64.5MB|17s|50ms|
|256kB|No|63.8MB|16.5s|40ms|
|256kB|Yes|54.7MB|17.5s|40ms|
|1MB|No|54.7MB|16s|32ms|
|1MB|Yes|52.3MB|16.5s|32ms|

Merging is always fast, because in this case we copy the compressed data 
directly. It looks like the relative inefficiency of the preset dictionary 
decreases as the block size increases, but I still worry that preset 
dictionaries would be challenging to integrate while still copying compressed 
data when merging as we can only do it if blocks use the same dictionary?

> Make BEST_COMPRESSION compress more aggressively?
> -------------------------------------------------
>
>                 Key: LUCENE-9447
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9447
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Priority: Minor
>
> The Lucene86 codec supports setting a "Mode" for stored fields compression, 
> that is either "BEST_SPEED", which translates to blocks of 16kB or 128 
> documents (whichever is hit first) compressed with LZ4, or 
> "BEST_COMPRESSION", which translates to blocks of 60kB or 512 documents 
> compressed with DEFLATE with default compression level (6).
> After looking at indices that spent most disk space on stored fields 
> recently, I noticed that there was quite some room for improvement by 
> increasing the block size even further:
> ||Block size||Stored fields size||
> |60kB|168412338|
> |128kB|130813639|
> |256kB|113587009|
> |512kB|104776378|
> |1MB|100367095|
> |2MB|98152464|
> |4MB|97034425|
> |8MB|96478746|
> For this specific dataset, I had 1M documents that each had about 2kB of 
> stored fields each and quite some redundancy.
> This makes me want to look into bumping this block size to maybe 256kB. It 
> would be interesting to re-do the experiments we did on LUCENE-6100 to see 
> how this affects the merging speed. That said I don't think it would be 
> terrible if the merging time increased a bit given that we already offer the 
> BEST_SPEED option for CPU-savvy users.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to