[GitHub] [iceberg] aokolnychyi opened a new pull request, #7897: Spark 3.4: Multiple shuffle partitions per file in compaction

via GitHub Fri, 23 Jun 2023 17:25:32 -0700


aokolnychyi opened a new pull request, #7897:
URL: https://github.com/apache/iceberg/pull/7897


   This PR adds a new compaction option called `shuffle-partitions-per-file` 
for shuffle-based file rewriters.
   
   By default, our shuffling file rewriters assume each shuffle partition would 
become a separate output file. Attempting to generate large output files of 512 
MB and more may strain the memory resources of the cluster as such rewrites 
would require lots of Spark memory. This parameter can be used to further 
divide up the data which will end up in a single file. For example, if the 
target file size is 2 GB, but the cluster can only handle shuffles of 512 MB, 
this parameter could be set to 4. Iceberg will use a custom coalesce operation 
to stitch these sorted partitions back together into a single sorted file.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] aokolnychyi opened a new pull request, #7897: Spark 3.4: Multiple shuffle partitions per file in compaction

Reply via email to