[ 
https://issues.apache.org/jira/browse/LUCENE-9447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17176951#comment-17176951
 ] 

Adrien Grand commented on LUCENE-9447:
--------------------------------------

Indeed larger blocks make retrieval slower. I left the idea of using preset 
dictionaries out for now and did more tests. In particular I played with the 
idea of compressing with DEFLATE on top of LZ4 or LZ4 on top of LZ4. Because 
LZ4 only replaces duplicate strings with references, compressing multiple times 
with it helps bring some strings closer to each other, which sometimes mean 
they would now be closer than the window size (32kB for DEFLATE, 64kB for LZ4). 
This blog post talks a bit more about this idea and here are data points on the 
same dataset as previously. LZ4H means the high-compression mode of LZ4, which 
does more work to find longer duplicate strings.

 
||Method||Index size(MB)||Index time(s)||Avg fetch time (us)||
|LZ4(16kB) (BEST_SPEED) |304,2  |9|     5|
|LZ4(60kB)      |141,7| 7,5|    10|
|LZ4H+LZ4(60kB)|        120,1|  16,5|   9|
|LZ4H(60kB)     |120,1| 15|     8|
|LZ4H+LZ4+HUFFMAN_ONLY(60kB)|   105,8|  19|     25|
|LZ4H+HUFFMAN_ONLY(60kB)        |105,7| 16,5|   23|
|LZ4(256kB)     |105,1| 7,5     |33|
|LZ4H+DEFLATE(60kB)     |102,7| 17,5    |26|
|DEFLATE(60kB) (BEST_COMPRESSION)|      100,6|  14      |35|
|LZ4(1MB)       |96,5|  7,5     |115|
|LZ4H(256kB)|   68,4    |14,5|  22|
|LZ4H+LZ4(256kB)|       64,6|   15|     29|
|DEFLATE(256kB)|        63,8|   16,5|   110|
|LZ4H+HUFFMAN_ONLY(256kB)       |59,1|  15      |54|
|LZ4H+LZ4+HUFFMAN_ONLY(256kB)|  57,7|   15,5    |58|
|LZ4H(1MB)      |56,1   |16,5|  76|
|DEFLATE(1MB)   |54,7|  16      |411|
|LZ4H+DEFLATE(256kB)    |54,5|  15,5|   57|
|LZ4H+LZ4(1MB)| 49,4|   17|     117|
|LZ4H+HUFFMAN_ONLY(1MB)|        47,9|   17.5|   170|
|LZ4H+LZ4+HUFFMAN_ONLY(1MB)     |44,6|  18      |194|
|LZ4H+DEFLATE(1MB)      |40,8|  18,5    |172|


Unfortunately, I get very different numbers for enwiki documents, which are 
less redundant and where Huffman compression is an important part of the 
compression ratio of DEFLATE, which makes DEFLATE alone unbeatable.


||Method||Index size(MB)||Index time(s)||Avg fetch time (us)||
|LZ4(16kB) (BEST_SPEED)|        558,8|  14,5|   83|
|LZ4(60kB)      |526,2| 15|     106|
|LZ4(256kB)|    523,1|  15|     323|
|LZ4(1MB)|      521,3|  15,5    |1151|
|LZ4H+LZ4(60kB)|        425,4   |37     |115|
|LZ4H(60kB)     |424,2| 32      |112|
|LZ4H+LZ4(256kB)        |397,5| 49|     267|
|LZ4H(256kB)    |396,4| 43|     267|
|LZ4H+LZ4(1MB)  |390,9| 64|     875|
|LZ4H(1MB)      |389,8| 61|     887|
|LZ4H+HUFFMAN_ONLY(60kB)|       377,6|  35      |240|
|LZ4H+DEFLATE(60kB)     |376,5  |41|    228|
|LZ4H+HUFFMAN_ONLY(256kB)       |357,1  |45|    668|
|LZ4H+DEFLATE(256kB)    |356,7| 53      |694|
|LZ4H+HUFFMAN_ONLY(1MB)|        352,3|  65|     2350|
|LZ4H+DEFLATE(1MB)      |352,1| 73      |2460|
|DEFLATE(60kB) (BEST_COMPRESSION)       |338,1| 34|     237|
|DEFLATE(256kB)|        328,5   |37     |732|
|DEFLATE(1MB)|  326,3|  39|     2624|

Based on this data I think that Mike's suggestion to increase the block size to 
256kB is a safe trade-off. I wonder whether we should also increase the block 
size for CompressionMode.FAST to 64kB.

> Make BEST_COMPRESSION compress more aggressively?
> -------------------------------------------------
>
>                 Key: LUCENE-9447
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9447
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Priority: Minor
>
> The Lucene86 codec supports setting a "Mode" for stored fields compression, 
> that is either "BEST_SPEED", which translates to blocks of 16kB or 128 
> documents (whichever is hit first) compressed with LZ4, or 
> "BEST_COMPRESSION", which translates to blocks of 60kB or 512 documents 
> compressed with DEFLATE with default compression level (6).
> After looking at indices that spent most disk space on stored fields 
> recently, I noticed that there was quite some room for improvement by 
> increasing the block size even further:
> ||Block size||Stored fields size||
> |60kB|168412338|
> |128kB|130813639|
> |256kB|113587009|
> |512kB|104776378|
> |1MB|100367095|
> |2MB|98152464|
> |4MB|97034425|
> |8MB|96478746|
> For this specific dataset, I had 1M documents that each had about 2kB of 
> stored fields each and quite some redundancy.
> This makes me want to look into bumping this block size to maybe 256kB. It 
> would be interesting to re-do the experiments we did on LUCENE-6100 to see 
> how this affects the merging speed. That said I don't think it would be 
> terrible if the merging time increased a bit given that we already offer the 
> BEST_SPEED option for CPU-savvy users.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to