[GitHub] [iceberg] danielcweeks commented on a diff in pull request #7731: Core: Implement adaptive split planning in core.

via GitHub Tue, 30 May 2023 10:17:51 -0700


danielcweeks commented on code in PR #7731:
URL: https://github.com/apache/iceberg/pull/7731#discussion_r1210586239



##########
core/src/main/java/org/apache/iceberg/util/TableScanUtil.java:
##########
@@ -79,6 +88,44 @@ public static CloseableIterable<FileScanTask> splitFiles(
     return CloseableIterable.combine(splitTasks, tasks);
   }
 
+  /**
+   * Produces {@link CombinedScanTask combined tasks} from an iterable of 
{@link FileScanTask file
+   * tasks}, using an adaptive target split size that targets a minimum number 
of tasks
+   * (parallelism).
+   *
+   * @param files incoming iterable of file tasks
+   * @param parallelism target minimum number of tasks
+   * @param splitSize target split size
+   * @param lookback bin packing lookback
+   * @param openFileCost minimum file cost
+   * @return an iterable of combined tasks
+   */
+  public static CloseableIterable<CombinedScanTask> planTasksAdaptive(
+      CloseableIterable<FileScanTask> files,
+      int parallelism,
+      long splitSize,
+      int lookback,
+      long openFileCost) {
+
+    validatePlanningArguments(splitSize, lookback, openFileCost);
+
+    Function<FileScanTask, Long> weightFunc =
+        file ->
+            Math.max(

Review Comment:
   I was looking at bin packing and it seems like we would always need similar 
weights (e.g. you can't use file size for one task and open cost for another 
because they're too dissimilar).  If open cost is in bytes, it feels like this 
should just be `totalFileSize + (fileCount * openCost)`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] danielcweeks commented on a diff in pull request #7731: Core: Implement adaptive split planning in core.

Reply via email to