The default should probably be LZ4. In our testing, LZ4 beat snappy for every dataset for read time, write time, and compression ratio. I believe it also typically got a better compression ratio than gzip. Gzip was the previous default because it does a better job on compression ratio than snappy.
Ryan On Thu, Jul 1, 2021 at 1:48 PM Sreeram Garlapati <[email protected]> wrote: > Hello Iceberg devs! > > Do any of you folks use the underlying file format as* Parquet + Snappy.* > Iceberg configures this by default as Parquet + gzip ( > *write.parquet.compression-codec*). > *Is there any specific reason for this Choice?* > > In our preliminary tests we found better numbers with *Parquet + Snappy* > than with *gzip*. > Operation = compress and write to local disk > File Size = 524.3MB (about the same with both the compression codecs) > row group size = 64mb. > > gzip snappy > 8.304 > 5.478 > > > We are still in the process of our full benchmarking (for reads) - but, > want to understand - if there is a whole different angle to this that we > are not thinking thru. > > Truly appreciate any inputs, > Sreeram > -- Ryan Blue Tabular
