Re: [I] write.parquet.page-size-bytes isn't respected when writing data [iceberg]

via GitHub Thu, 05 Oct 2023 23:18:08 -0700


amogh-jahagirdar commented on issue #8729:
URL: https://github.com/apache/iceberg/issues/8729#issuecomment-1750041439


   The issue title and parts of the description refer to 
`write.parquet.page-size-bytes` but what you are describing is the file size 
and in your code also you refer to file size, did you perhaps mean 
`write.target-file-size-bytes` instead? I'm assuming that's the case based on 
your question being around the file size.
   
   Check out 
https://iceberg.apache.org/docs/latest/spark-writes/#controlling-file-sizes and 
https://iceberg.apache.org/docs/latest/spark-writes/#writing-distribution-modes 
   
   
   The file size will be bounded by the Spark task size; if the task size 
exceeds the write.target-file-size-bytes, the writer will roll over to a new 
file. However, if the task size is smaller there's no "roll over". When this 
gets written to disk, since Parquet is highly compressible it'll be even 
smaller. 
   
   There's a write.distribution-mode table property which how to distribute the 
data across spark tasks performing the writes. 
   Prior to 1.2.0 this was `none`, which required explicit ordering by 
partition; for tables created after 1.2.0 this is `hash` which shuffles the 
data via hash prior to writing. This change was done to alleviate the small 
files problem, so the Iceberg version you are using will also be helpful info.
   
   @RussellSpitzer @aokolnychyi would also have more expertise in this area, so 
please correct me if I'm wrong about anything!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] write.parquet.page-size-bytes isn't respected when writing data [iceberg]

Reply via email to