aokolnychyi commented on code in PR #7897:
URL: https://github.com/apache/iceberg/pull/7897#discussion_r1240903803
##########
spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/actions/SparkShufflingDataRewriter.java:
##########
@@ -59,7 +61,24 @@ abstract class SparkShufflingDataRewriter extends
SparkSizeBasedDataRewriter {
public static final double COMPRESSION_FACTOR_DEFAULT = 1.0;
+ /**
Review Comment:
I tested the current implementation on a table with 1 TB of data and a
cluster with 16 GB executors 7 cores each. The target file size is 1 GB (zstd
Parquet data). Sort-based optimizations without this option were spilling and
failed, I lost all executors one by one. I tried using 8 shuffle partitions per
file and the operation succeeded without any failures and produced properly
sized files.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]