aokolnychyi commented on code in PR #7731:
URL: https://github.com/apache/iceberg/pull/7731#discussion_r1217415171


##########
core/src/main/java/org/apache/iceberg/util/TableScanUtil.java:
##########
@@ -79,6 +88,44 @@ public static CloseableIterable<FileScanTask> splitFiles(
     return CloseableIterable.combine(splitTasks, tasks);
   }
 
+  /**
+   * Produces {@link CombinedScanTask combined tasks} from an iterable of 
{@link FileScanTask file
+   * tasks}, using an adaptive target split size that targets a minimum number 
of tasks
+   * (parallelism).
+   *
+   * @param files incoming iterable of file tasks
+   * @param parallelism target minimum number of tasks
+   * @param splitSize target split size
+   * @param lookback bin packing lookback
+   * @param openFileCost minimum file cost
+   * @return an iterable of combined tasks
+   */
+  public static CloseableIterable<CombinedScanTask> planTasksAdaptive(
+      CloseableIterable<FileScanTask> files,
+      int parallelism,
+      long splitSize,
+      int lookback,
+      long openFileCost) {
+
+    validatePlanningArguments(splitSize, lookback, openFileCost);
+
+    Function<FileScanTask, Long> weightFunc =
+        file ->
+            Math.max(

Review Comment:
   The purpose of the open file cost was to avoid packing lots of tiny KB-sized 
files into a single split of 128 MB, which would create straggler tasks with 
many files. When we added this, we only had data files and we thought the cost 
of opening files should not matter if file sizes are reasonable.
   
   I still tend to believe the cost of opening files should not matter for 
reasonably sized files to avoid changes in the way splits are packed right now. 
Comparing `fileSize` and `openCost` per each file seems worth exploring.



##########
core/src/main/java/org/apache/iceberg/util/TableScanUtil.java:
##########
@@ -79,6 +88,44 @@ public static CloseableIterable<FileScanTask> splitFiles(
     return CloseableIterable.combine(splitTasks, tasks);
   }
 
+  /**
+   * Produces {@link CombinedScanTask combined tasks} from an iterable of 
{@link FileScanTask file
+   * tasks}, using an adaptive target split size that targets a minimum number 
of tasks
+   * (parallelism).
+   *
+   * @param files incoming iterable of file tasks
+   * @param parallelism target minimum number of tasks
+   * @param splitSize target split size
+   * @param lookback bin packing lookback
+   * @param openFileCost minimum file cost
+   * @return an iterable of combined tasks
+   */
+  public static CloseableIterable<CombinedScanTask> planTasksAdaptive(
+      CloseableIterable<FileScanTask> files,
+      int parallelism,
+      long splitSize,
+      int lookback,
+      long openFileCost) {
+
+    validatePlanningArguments(splitSize, lookback, openFileCost);
+
+    Function<FileScanTask, Long> weightFunc =
+        file ->
+            Math.max(

Review Comment:
   The purpose of the open file cost was to avoid packing lots of tiny KB-sized 
files into a single split of 128 MB, which would create straggler tasks with 
many files. When we added this, we only had data files and we thought the cost 
of opening files should not matter if file sizes are reasonable.
   
   I still tend to believe the cost of opening files should not matter for 
reasonably sized files to avoid changes in the way splits are packed right now.
   
   Comparing `fileSize` and `openCost` per each file seems worth exploring.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to