aokolnychyi commented on code in PR #7714:
URL: https://github.com/apache/iceberg/pull/7714#discussion_r1283804917
##########
spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/source/SparkPartitioningAwareScan.java:
##########
@@ -232,6 +232,16 @@ protected synchronized List<ScanTaskGroup<T>> taskGroups()
{
return taskGroups;
}
+ private long targetSplitSize() {
+ if (readConf().adaptiveSplitSizeEnabled()) {
+ long scanSize = tasks().stream().mapToLong(ScanTask::sizeBytes).sum();
+ int parallelism = sparkContext().defaultParallelism();
Review Comment:
@rdblue, I meant the core count would adjust once the cluster scales up. The
initial job may not benefit from this. I wasn't sure whether that is a big deal
given that acquiring new executors is generally slow.
I feel we should use the current core count if dynamic allocation is
disabled (which we can check). When dynamic allocation is enabled, we can rely
on the number of shuffle partitions or check the dynamic allocation config
(e.g. we know the core count per each executor and the max number of
executors). It seems the dynamic allocation config would give us a more precise
estimate.
Thoughts, @rdblue @ConeyLiu?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]