amogh-jahagirdar commented on issue #8729: URL: https://github.com/apache/iceberg/issues/8729#issuecomment-1750041439
The issue title and parts of the description refer to `write.parquet.page-size-bytes` but what you are describing is the file size and in your code also you refer to file size, did you perhaps mean `write.target-file-size-bytes` instead? I'm assuming that's the case based on your question being around the file size. Check out https://iceberg.apache.org/docs/latest/spark-writes/#controlling-file-sizes and https://iceberg.apache.org/docs/latest/spark-writes/#writing-distribution-modes The file size will be bounded by the Spark task size; if the task size exceeds the write.target-file-size-bytes, the writer will roll over to a new file. However, if the task size is smaller there's no "roll over". When this gets written to disk, since Parquet is highly compressible it'll be even smaller. There's a write.distribution-mode table property which how to distribute the data across spark tasks performing the writes. Prior to 1.2.0 this was `none`, which required explicit ordering by partition; for tables created after 1.2.0 this is `hash` which shuffles the data via hash prior to writing. This change was done to alleviate the small files problem, so the Iceberg version you are using will also be helpful info. @RussellSpitzer @aokolnychyi would also have more expertise in this area, so please correct me if I'm wrong about anything! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org