Re: Parquet with - Snappy vs gzip

Ryan Blue Thu, 01 Jul 2021 14:01:51 -0700

You should probably try Zstd while you're at it. We had great results with
Zstd as well. My conclusion was that Zstd is probably the right choice when
you want higher compression ratios and LZ4 was the right choice when you
didn't need great compression but wanted fast compression and decompression
speeds. Zstd pretty much replaces gzip and LZ4 replaces snappy.


On Thu, Jul 1, 2021 at 1:59 PM Sreeram Garlapati <[email protected]>
wrote:

> Slick, thanks @Ryan Blue <[email protected]>. We will add LZ4 to our mix
> and report back if we find anything different.
>
> On Thu, Jul 1, 2021 at 1:50 PM Ryan Blue <[email protected]> wrote:
>
>> The default should probably be LZ4. In our testing, LZ4 beat snappy for
>> every dataset for read time, write time, and compression ratio. I believe
>> it also typically got a better compression ratio than gzip. Gzip was the
>> previous default because it does a better job on compression ratio than
>> snappy.
>>
>> Ryan
>>
>> On Thu, Jul 1, 2021 at 1:48 PM Sreeram Garlapati <[email protected]>
>> wrote:
>>
>>> Hello Iceberg devs!
>>>
>>> Do any of you folks use the underlying file format as* Parquet +
>>> Snappy.*
>>> Iceberg configures this by default as Parquet + gzip (
>>> *write.parquet.compression-codec*).
>>> *Is there any specific reason for this Choice?*
>>>
>>> In our preliminary tests we found better numbers with *Parquet + Snappy*
>>> than with *gzip*.
>>> Operation = compress and write to local disk
>>> File Size = 524.3MB (about the same with both the compression codecs)
>>> row group size = 64mb.
>>>
>>> gzip snappy
>>> 8.304
>>> 5.478
>>>
>>>
>>> We are still in the process of our full benchmarking (for reads) - but,
>>> want to understand - if there is a whole different angle to this that we
>>> are not thinking thru.
>>>
>>> Truly appreciate any inputs,
>>> Sreeram
>>>
>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>

-- 
Ryan Blue
Tabular

Re: Parquet with - Snappy vs gzip

Reply via email to