Re: [PR] Spark: Merge new position deletes with old deletes during writing [iceberg]

via GitHub Wed, 09 Oct 2024 16:16:07 -0700


singhpk234 commented on code in PR #11273:
URL: https://github.com/apache/iceberg/pull/11273#discussion_r1794359228



##########
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkBatchQueryScan.java:
##########
@@ -158,6 +163,26 @@ public void filter(Predicate[] predicates) {
     }
   }
 
+  protected Map<String, DeleteFileSet> dataToFileScopedDeletes() {

Review Comment:
   [doubt] why do we need this whole hash-map of all the files with deletes 
being broadcasted from driver to executor ? since they are being anyways 
derived from scanTasks and spark executor ideally should have scanTask() so can 
we not create a local hashmap within executor and merge as executors will need 
to apply all the deletes to the give data file it points to ? Am i missing 
something here ? 
   
   
   [a bit orthogonal] can we put an estimate on the size of the HM ? if it goes 
very high it can fail the query ?  i think the size is 8GB if IIRC.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Spark: Merge new position deletes with old deletes during writing [iceberg]

Reply via email to