RussellSpitzer commented on issue #8729: URL: https://github.com/apache/iceberg/issues/8729#issuecomment-1751713732
Everything Amogh said is correct, write target file size is the max a writer will produce not the minimum. Amount of data written to a file is dependent on the amount of data in the Spark Task. This is controlled by advisory partition size if you are using hash or range distributions, if you are not using any write distribution it is just equal to the size of the spark tasks. As for your questions, 1. It's whatever the shuffle engine things the size of the spark serialized rows are. 2. Yes BUT these only apply if a shuffle is happening before write which only happens if write distribution mode is hash or range. Iceberg has no additional coalescing rules. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org