Re: [PR] [WIP] API, Core, Spark 3.5: Parallelize reading of deletes and cache them on executors [iceberg]

via GitHub Mon, 09 Oct 2023 18:02:55 -0700


aokolnychyi commented on PR #8755:
URL: https://github.com/apache/iceberg/pull/8755#issuecomment-1754147825


   > Should we apply some intelligence on how we are distributing the tasks so 
that we could utilize the max from the executor cache ? For ex : lets say we 
could prefer sending those set of data files which have a lot of overlapping 
delete files or may be belong to some partition (for ex : position deletes) ?
   
   @singhpk234, I have a follow-up change to do that. Unfortunately, it is a 
bit controversial. There is no way to express task affinity in Spark, only 
locality. The best option for us is to implement what `KafkaRDD` does. The 
problem is that it only works well if dynamic allocation is disabled. Even 
without that, this feature should be useful.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] [WIP] API, Core, Spark 3.5: Parallelize reading of deletes and cache them on executors [iceberg]

Reply via email to