aokolnychyi opened a new pull request, #7897: URL: https://github.com/apache/iceberg/pull/7897
This PR adds a new compaction option called `shuffle-partitions-per-file` for shuffle-based file rewriters. By default, our shuffling file rewriters assume each shuffle partition would become a separate output file. Attempting to generate large output files of 512 MB and more may strain the memory resources of the cluster as such rewrites would require lots of Spark memory. This parameter can be used to further divide up the data which will end up in a single file. For example, if the target file size is 2 GB, but the cluster can only handle shuffles of 512 MB, this parameter could be set to 4. Iceberg will use a custom coalesce operation to stitch these sorted partitions back together into a single sorted file. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
