[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #8660: Spark 3.5: Increase default advisory partition size for writes

via GitHub Wed, 27 Sep 2023 10:04:19 -0700


aokolnychyi commented on code in PR #8660:
URL: https://github.com/apache/iceberg/pull/8660#discussion_r1338931458



##########
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkWriteRequirements.java:
##########
@@ -20,20 +20,24 @@
 
 import org.apache.spark.sql.connector.distributions.Distribution;
 import org.apache.spark.sql.connector.distributions.Distributions;
+import org.apache.spark.sql.connector.distributions.UnspecifiedDistribution;
 import org.apache.spark.sql.connector.expressions.SortOrder;
 
 /** A set of requirements such as distribution and ordering reported to Spark 
during writes. */
 public class SparkWriteRequirements {
 
   public static final SparkWriteRequirements EMPTY =
-      new SparkWriteRequirements(Distributions.unspecified(), new 
SortOrder[0]);
+      new SparkWriteRequirements(Distributions.unspecified(), new 
SortOrder[0], 0);

Review Comment:
   Yes, it matches the default value in `RequiresDistributionAndOrdering` and 
means no preference.
   
   ```
   /**
    * Returns the advisory (not guaranteed) shuffle partition size in bytes for 
this write.
    * <p>
    * Implementations may override this to indicate the preferable partition 
size in shuffles
    * performed to satisfy the requested distribution. Note that Spark doesn't 
support setting
    * the advisory partition size for {@link UnspecifiedDistribution}, the 
query will fail if
    * the advisory partition size is set but the distribution is unspecified. 
Data sources may
    * either request a particular number of partitions via {@link 
#requiredNumPartitions()} or
    * a preferred partition size, not both.
    * <p>
    * Data sources should be careful with large advisory sizes as it will 
impact the writing
    * parallelism and may degrade the overall job performance.
    * <p>
    * Note this value only acts like a guidance and Spark does not guarantee 
the actual and advisory
    * shuffle partition sizes will match. Ignored if the adaptive execution is 
disabled.
    *
    * @return the advisory partition size, any value less than 1 means no 
preference.
    */
   default long advisoryPartitionSizeInBytes() { return 0; }
   ```



##########
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkWriteRequirements.java:
##########
@@ -47,4 +51,8 @@ public SortOrder[] ordering() {
   public boolean hasOrdering() {
     return ordering.length != 0;
   }
+
+  public long advisoryPartitionSize() {
+    return distribution instanceof UnspecifiedDistribution ? 0 : 
advisoryPartitionSize;

Review Comment:
   Will add.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #8660: Spark 3.5: Increase default advisory partition size for writes

Reply via email to