The default should probably be LZ4. In our testing, LZ4 beat snappy for
every dataset for read time, write time, and compression ratio. I believe
it also typically got a better compression ratio than gzip. Gzip was the
previous default because it does a better job on compression ratio than
snappy.

Ryan

On Thu, Jul 1, 2021 at 1:48 PM Sreeram Garlapati <[email protected]>
wrote:

> Hello Iceberg devs!
>
> Do any of you folks use the underlying file format as* Parquet + Snappy.*
> Iceberg configures this by default as Parquet + gzip (
> *write.parquet.compression-codec*).
> *Is there any specific reason for this Choice?*
>
> In our preliminary tests we found better numbers with *Parquet + Snappy*
> than with *gzip*.
> Operation = compress and write to local disk
> File Size = 524.3MB (about the same with both the compression codecs)
> row group size = 64mb.
>
> gzip snappy
> 8.304
> 5.478
>
>
> We are still in the process of our full benchmarking (for reads) - but,
> want to understand - if there is a whole different angle to this that we
> are not thinking thru.
>
> Truly appreciate any inputs,
> Sreeram
>


-- 
Ryan Blue
Tabular

Reply via email to