[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #7897: Spark 3.4: Multiple shuffle partitions per file in compaction

via GitHub Sat, 24 Jun 2023 10:11:59 -0700


aokolnychyi commented on code in PR #7897:
URL: https://github.com/apache/iceberg/pull/7897#discussion_r1240903803



##########
spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/actions/SparkShufflingDataRewriter.java:
##########
@@ -59,7 +61,24 @@ abstract class SparkShufflingDataRewriter extends 
SparkSizeBasedDataRewriter {
 
   public static final double COMPRESSION_FACTOR_DEFAULT = 1.0;
 
+  /**

Review Comment:
   I tested the current implementation on a table with 1 TB of data and a 
cluster with 16 GB executors 7 cores each. The target file size is 1 GB (zstd 
Parquet data). Sort-based optimizations without this option were spilling and 
failed, I lost all executors one by one. I tried using 8 shuffle partitions per 
file and the operation succeeded without any failures and produced properly 
sized files.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #7897: Spark 3.4: Multiple shuffle partitions per file in compaction

Reply via email to