[GitHub] [iceberg] szehon-ho commented on a diff in pull request #7897: Spark 3.4: Multiple shuffle partitions per file in compaction

via GitHub Mon, 26 Jun 2023 14:00:43 -0700


szehon-ho commented on code in PR #7897:
URL: https://github.com/apache/iceberg/pull/7897#discussion_r1242727767



##########
spark/v3.4/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestRewriteDataFilesProcedure.java:
##########
@@ -225,6 +249,43 @@ public void testRewriteDataFilesWithZOrder() {
     assertEquals("Should have expected rows", expectedRows, sql("SELECT * FROM 
%s", tableName));
   }
 
+  @Test

Review Comment:
   This is nice, but can we also add a test that assert the sort order is 
preserved within partition?  (ex, small partition, and just assert that that 
the file is in order)



##########
spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/actions/SparkShufflingDataRewriter.java:
##########
@@ -59,7 +61,24 @@ abstract class SparkShufflingDataRewriter extends 
SparkSizeBasedDataRewriter {
 
   public static final double COMPRESSION_FACTOR_DEFAULT = 1.0;
 
+  /**
+   * The number of shuffle partitions to use for each output file. By default, 
this file rewriter
+   * assumes each shuffle partition would become a separate output file. 
Attempting to generate
+   * large output files of 512 MB and more may strain the memory resources of 
the cluster as such

Review Comment:
   and more => or higher



##########
spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/actions/SparkShufflingDataRewriter.java:
##########
@@ -59,7 +61,24 @@ abstract class SparkShufflingDataRewriter extends 
SparkSizeBasedDataRewriter {
 
   public static final double COMPRESSION_FACTOR_DEFAULT = 1.0;
 
+  /**
+   * The number of shuffle partitions to use for each output file. By default, 
this file rewriter
+   * assumes each shuffle partition would become a separate output file. 
Attempting to generate
+   * large output files of 512 MB and more may strain the memory resources of 
the cluster as such
+   * rewrites would require lots of Spark memory. This parameter can be used 
to further divide up
+   * the data which will end up in a single file. For example, if the target 
file size is 2 GB, but
+   * the cluster can only handle shuffles of 512 MB, this parameter could be 
set to 4. Iceberg will
+   * use a custom coalesce operation to stitch these sorted partitions back 
together into a single
+   * sorted file.
+   *
+   * <p>Note using this parameter requires enabling Iceberg Spark session 
extensions.
+   */
+  public static final String SHUFFLE_PARTITIONS_PER_FILE = 
"shuffle-partitions-per-file";

Review Comment:
   Not to block this change, but did we consider having shuffle-threshold?  Ie, 
if we have some partition with 2G but others that are way less than 512MB, no 
need to shuffle the ones that are less?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] szehon-ho commented on a diff in pull request #7897: Spark 3.4: Multiple shuffle partitions per file in compaction

Reply via email to