Re: [PR] [Spark]Add max files rewrite option for RewriteAction [iceberg]

via GitHub Mon, 12 May 2025 04:37:43 -0700


pvary commented on code in PR #12824:
URL: https://github.com/apache/iceberg/pull/12824#discussion_r2084484027



##########
core/src/main/java/org/apache/iceberg/actions/BinPackRewriteFilePlanner.java:
##########
@@ -199,30 +214,48 @@ protected long defaultTargetFileSize() {
   public FileRewritePlan<FileGroupInfo, FileScanTask, DataFile, 
RewriteFileGroup> plan() {
     StructLikeMap<List<List<FileScanTask>>> plan = planFileGroups();
     RewriteExecutionContext ctx = new RewriteExecutionContext();
-    Stream<RewriteFileGroup> groups =
-        plan.entrySet().stream()
-            .filter(e -> !e.getValue().isEmpty())
-            .flatMap(
-                e -> {
-                  StructLike partition = e.getKey();
-                  List<List<FileScanTask>> scanGroups = e.getValue();
-                  return scanGroups.stream()
-                      .map(
-                          tasks -> {
-                            long inputSize = inputSize(tasks);
-                            return newRewriteGroup(
-                                ctx,
-                                partition,
-                                tasks,
-                                inputSplitSize(inputSize),
-                                expectedOutputFiles(inputSize));
-                          });
-                })
-            .sorted(RewriteFileGroup.comparator(rewriteJobOrder));
+    List<RewriteFileGroup> selectedFileGroups = new ArrayList<>();
+    AtomicInteger fileCountRunner = new AtomicInteger();
+    plan.entrySet().stream()

Review Comment:
   I think it is not a good place to truncate.
   If there are many files in the table and the first `max-files-to-rewrite` 
files are correctly sized, then we truncate at the given number, and later 
remove the other files from the plan. So basically we end skipping the 
compaction.
   
   We minimally have to do the filtering after the applying the `filterFiles`, 
but that could still result in filtering out whole groups by 
`filterFileGroups`. That would cause the similar issue but less frequently. We 
can end up in a situation where we compact fewer files than configured.
   
   Checked the code, and it seems that we need to filter the result of the 
`SizeBasedFileRewritePlanner.planFileGroups` to get the correct number of 
files. At this point the tasks are all materialized, so we are back in the 
original situation. We need to accept that we don't solve the memory issue with 
this solution.
   
   WDYT? Did I miss something?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] [Spark]Add max files rewrite option for RewriteAction [iceberg]

Reply via email to