pvary commented on code in PR #12824: URL: https://github.com/apache/iceberg/pull/12824#discussion_r2084484027
########## core/src/main/java/org/apache/iceberg/actions/BinPackRewriteFilePlanner.java: ########## @@ -199,30 +214,48 @@ protected long defaultTargetFileSize() { public FileRewritePlan<FileGroupInfo, FileScanTask, DataFile, RewriteFileGroup> plan() { StructLikeMap<List<List<FileScanTask>>> plan = planFileGroups(); RewriteExecutionContext ctx = new RewriteExecutionContext(); - Stream<RewriteFileGroup> groups = - plan.entrySet().stream() - .filter(e -> !e.getValue().isEmpty()) - .flatMap( - e -> { - StructLike partition = e.getKey(); - List<List<FileScanTask>> scanGroups = e.getValue(); - return scanGroups.stream() - .map( - tasks -> { - long inputSize = inputSize(tasks); - return newRewriteGroup( - ctx, - partition, - tasks, - inputSplitSize(inputSize), - expectedOutputFiles(inputSize)); - }); - }) - .sorted(RewriteFileGroup.comparator(rewriteJobOrder)); + List<RewriteFileGroup> selectedFileGroups = new ArrayList<>(); + AtomicInteger fileCountRunner = new AtomicInteger(); + plan.entrySet().stream() Review Comment: I think it is not a good place to truncate. If there are many files in the table and the first `max-files-to-rewrite` files are correctly sized, then we truncate at the given number, and later remove the other files from the plan. So basically we end skipping the compaction. We minimally have to do the filtering after the applying the `filterFiles`, but that could still result in filtering out whole groups by `filterFileGroups`. That would cause the similar issue but less frequently. We can end up in a situation where we compact fewer files than configured. Checked the code, and it seems that we need to filter the result of the `SizeBasedFileRewritePlanner.planFileGroups` to get the correct number of files. At this point the tasks are all materialized, so we are back in the original situation. We need to accept that we don't solve the memory issue with this solution. WDYT? Did I miss something? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org