[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #7714: Spark 3.4: Adaptive split size

via GitHub Thu, 03 Aug 2023 16:15:07 -0700


aokolnychyi commented on code in PR #7714:
URL: https://github.com/apache/iceberg/pull/7714#discussion_r1283804917



##########
spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/source/SparkPartitioningAwareScan.java:
##########
@@ -232,6 +232,16 @@ protected synchronized List<ScanTaskGroup<T>> taskGroups() 
{
     return taskGroups;
   }
 
+  private long targetSplitSize() {
+    if (readConf().adaptiveSplitSizeEnabled()) {
+      long scanSize = tasks().stream().mapToLong(ScanTask::sizeBytes).sum();
+      int parallelism = sparkContext().defaultParallelism();

Review Comment:
   @rdblue, I meant the core count would adjust once the cluster scales up. The 
initial job may not benefit from this. I wasn't sure whether that is a big deal 
given that acquiring new executors is generally slow. 
   
   I feel we should use the current core count if dynamic allocation is 
disabled (which we can check). When dynamic allocation is enabled, we can rely 
on the number of shuffle partitions or check the dynamic allocation config 
(e.g. we know the core count per each executor and the max number of 
executors). It seems the dynamic allocation config would give us a more precise 
estimate.
   
   Thoughts, @rdblue @ConeyLiu?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #7714: Spark 3.4: Adaptive split size

Reply via email to