jinyangli34 commented on PR #11258: URL: https://github.com/apache/iceberg/pull/11258#issuecomment-2430386830
> > This makes it difficult to estimate the current row group size, and result in creating much smaller row-group than `write.parquet.row-group-size-bytes` config > > @jinyangli34 is this because your assumption is that `write.parquet.row-group-size-bytes` refers to the compressed row group size? @nastra yes, see discussion here: https://apache-iceberg.slack.com/archives/C025PH0G1D4/p1727984247713889 and from Parquet docs, it maps to HDFS block size, so should be on disk size. ``` Row Group Size Larger row groups allow for larger column chunks which makes it possible to do larger sequential IO. Larger groups also require more buffering in the write path (or a two pass write). We recommend large row groups (512MB - 1GB). Since an entire row group might need to be read, we want it to completely fit on one HDFS block. Therefore, HDFS block sizes should also be set to be larger. An optimized read setup would be: 1GB row groups, 1GB HDFS block size, 1 HDFS block per HDFS file ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org