Re: [PR] More accurate estimate on parquet row groups size [iceberg]

via GitHub Tue, 22 Oct 2024 14:55:56 -0700


jinyangli34 commented on PR #11258:
URL: https://github.com/apache/iceberg/pull/11258#issuecomment-2430386830


   > > This makes it difficult to estimate the current row group size, and 
result in creating much smaller row-group than 
`write.parquet.row-group-size-bytes` config
   > 
   > @jinyangli34 is this because your assumption is that 
`write.parquet.row-group-size-bytes` refers to the compressed row group size?
   
   @nastra yes, see discussion here: 
https://apache-iceberg.slack.com/archives/C025PH0G1D4/p1727984247713889 
   
   and from Parquet docs, it maps to HDFS block size, so should be on disk 
size. 
   ```
   Row Group Size
   
   Larger row groups allow for larger column chunks which makes it possible to 
do larger sequential IO. Larger groups also require more buffering in the write 
path (or a two pass write). We recommend large row groups (512MB - 1GB). Since 
an entire row group might need to be read, we want it to completely fit on one 
HDFS block. Therefore, HDFS block sizes should also be set to be larger. An 
optimized read setup would be: 1GB row groups, 1GB HDFS block size, 1 HDFS 
block per HDFS file
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] More accurate estimate on parquet row groups size [iceberg]

Reply via email to