[GitHub] [iceberg] xloya commented on a diff in pull request #7096: Spark: Close auto broadcast join in delete orphan action

via GitHub Tue, 21 Mar 2023 04:40:29 -0700


xloya commented on code in PR #7096:
URL: https://github.com/apache/iceberg/pull/7096#discussion_r1143246891



##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/actions/DeleteOrphanFilesSparkAction.java:
##########
@@ -123,10 +124,10 @@ public class DeleteOrphanFilesSparkAction extends 
BaseSparkAction<DeleteOrphanFi
   private ExecutorService deleteExecutorService = null;
 
   DeleteOrphanFilesSparkAction(SparkSession spark, Table table) {
-    super(spark);
-
-    this.hadoopConf = new 
SerializableConfiguration(spark.sessionState().newHadoopConf());
-    this.listingParallelism = 
spark.sessionState().conf().parallelPartitionDiscoveryParallelism();
+    super(spark.cloneSession());
+    spark().conf().set(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD().key(), -1);

Review Comment:
   @sririshindra Hardcoding is really not the most elegant solution. But 
compared to the risk of OOM, I think the cost of Sort Merge Join is acceptable. 
The reason for this problem is that we scan the metadata first, and Spark uses 
the metadata table for estimation, so the most radical solution is to make the 
estimation more accurate with Spark.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] xloya commented on a diff in pull request #7096: Spark: Close auto broadcast join in delete orphan action

Reply via email to