szehon-ho commented on code in PR #7897:
URL: https://github.com/apache/iceberg/pull/7897#discussion_r1242881883
##########
spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/actions/SparkShufflingDataRewriter.java:
##########
@@ -59,7 +61,24 @@ abstract class SparkShufflingDataRewriter extends
SparkSizeBasedDataRewriter {
public static final double COMPRESSION_FACTOR_DEFAULT = 1.0;
+ /**
+ * The number of shuffle partitions to use for each output file. By default,
this file rewriter
+ * assumes each shuffle partition would become a separate output file.
Attempting to generate
+ * large output files of 512 MB and more may strain the memory resources of
the cluster as such
+ * rewrites would require lots of Spark memory. This parameter can be used
to further divide up
+ * the data which will end up in a single file. For example, if the target
file size is 2 GB, but
+ * the cluster can only handle shuffles of 512 MB, this parameter could be
set to 4. Iceberg will
+ * use a custom coalesce operation to stitch these sorted partitions back
together into a single
+ * sorted file.
+ *
+ * <p>Note using this parameter requires enabling Iceberg Spark session
extensions.
+ */
+ public static final String SHUFFLE_PARTITIONS_PER_FILE =
"shuffle-partitions-per-file";
Review Comment:
I was just wondering the use case, where we set shuffle-partitions-per-file
to 4, because we want 4GB files but can only shuffle 512mb. However, consider
an Iceberg partition (rewrite group) that has only 512MB files. Will we still
shuffle to four partitions in this case and coalesce at end, unnecessarily? I
may be missing something.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]