[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6371: Spark 3.3: Support storage-partitioned joins

GitBox Wed, 07 Dec 2022 12:08:05 -0800


aokolnychyi commented on code in PR #6371:
URL: https://github.com/apache/iceberg/pull/6371#discussion_r1042631790



##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkSQLProperties.java:
##########
@@ -42,4 +42,9 @@ private SparkSQLProperties() {}
   // Controls whether to check the order of fields during writes
   public static final String CHECK_ORDERING = 
"spark.sql.iceberg.check-ordering";
   public static final boolean CHECK_ORDERING_DEFAULT = true;
+
+  // Controls whether to preserve the existing grouping of data while planning 
splits
+  public static final String PRESERVE_DATA_GROUPING =
+      "spark.sql.iceberg.split.preserve-data-grouping";
+  public static final boolean PRESERVE_DATA_GROUPING_DEFAULT = false;

Review Comment:
   My worry is the performance regressions that this may cause. There may be 
substantially more splits if this config is on. In order to benefit from SPJ, 
joins must have equality conditions on partition columns. Spark will propagate 
join conditions in the future. Our long-term plan may be to check if v2 
bucketing enabled and whether we have join conditions on partition columns to 
return true by default. Even that will be suboptimal cause we don't know if SPJ 
would work.



##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkSQLProperties.java:
##########
@@ -42,4 +42,9 @@ private SparkSQLProperties() {}
   // Controls whether to check the order of fields during writes
   public static final String CHECK_ORDERING = 
"spark.sql.iceberg.check-ordering";
   public static final boolean CHECK_ORDERING_DEFAULT = true;
+
+  // Controls whether to preserve the existing grouping of data while planning 
splits
+  public static final String PRESERVE_DATA_GROUPING =
+      "spark.sql.iceberg.split.preserve-data-grouping";
+  public static final boolean PRESERVE_DATA_GROUPING_DEFAULT = false;

Review Comment:
   My worry is performance regressions that this may cause. There may be 
substantially more splits if this config is on. In order to benefit from SPJ, 
joins must have equality conditions on partition columns. Spark will propagate 
join conditions in the future. Our long-term plan may be to check if v2 
bucketing enabled and whether we have join conditions on partition columns to 
return true by default. Even that will be suboptimal cause we don't know if SPJ 
would work.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6371: Spark 3.3: Support storage-partitioned joins

Reply via email to