[ 
https://issues.apache.org/jira/browse/LUCENE-9447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17179025#comment-17179025
 ] 

Adrien Grand commented on LUCENE-9447:
--------------------------------------

I've been playing more with the idea of preset dictionaries and found some 
combinations that are providing interesting results. In order to preserve bulk 
merging I'm still splitting data into blocks the same way as today, but then I 
have sub blocks too where the first one is compressed on its own and serves as 
a preset dictionary for all other sub blocks.

On the same dataset I've been using in previous comments:

||Block size||Size of first sub-block||Size of other sub blocks||Index 
size(MB)||Index time(s)||Avg fetch time (us)||
|256kB|8kB|48kB|70,1|14,5|36|
|1MB|8kB|48kB|62,5|14|42|

One benefit of this approach is that increasing the overall block size no 
longer hurts fetch times too much because we can usually skip decompressing 
most sub blocks. And it improves compression ratios since all sub blocks but 
the first one start with an ok preset dictionary. As Robert noted, setting 
preset dictionaries is costly, which is why I'm using a preset dictionary of 
8kB, which seems to be a good trade-off between index-time overhead and 
compression ratios.

I find it generally better than both the current approach (DEFLATE with blocks 
of 60kB) and the new one I was considering (DEFLATE with blocks of 256kB) which 
is promising, so I'll look into cleaning up the very hacky patch I wrote to 
test this.

> Make BEST_COMPRESSION compress more aggressively?
> -------------------------------------------------
>
>                 Key: LUCENE-9447
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9447
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Priority: Minor
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> The Lucene86 codec supports setting a "Mode" for stored fields compression, 
> that is either "BEST_SPEED", which translates to blocks of 16kB or 128 
> documents (whichever is hit first) compressed with LZ4, or 
> "BEST_COMPRESSION", which translates to blocks of 60kB or 512 documents 
> compressed with DEFLATE with default compression level (6).
> After looking at indices that spent most disk space on stored fields 
> recently, I noticed that there was quite some room for improvement by 
> increasing the block size even further:
> ||Block size||Stored fields size||
> |60kB|168412338|
> |128kB|130813639|
> |256kB|113587009|
> |512kB|104776378|
> |1MB|100367095|
> |2MB|98152464|
> |4MB|97034425|
> |8MB|96478746|
> For this specific dataset, I had 1M documents that each had about 2kB of 
> stored fields each and quite some redundancy.
> This makes me want to look into bumping this block size to maybe 256kB. It 
> would be interesting to re-do the experiments we did on LUCENE-6100 to see 
> how this affects the merging speed. That said I don't think it would be 
> terrible if the merging time increased a bit given that we already offer the 
> BEST_SPEED option for CPU-savvy users.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to