coderfender commented on code in PR #12824: URL: https://github.com/apache/iceberg/pull/12824#discussion_r2085140996
########## core/src/main/java/org/apache/iceberg/actions/BinPackRewriteFilePlanner.java: ########## @@ -199,30 +214,48 @@ protected long defaultTargetFileSize() { public FileRewritePlan<FileGroupInfo, FileScanTask, DataFile, RewriteFileGroup> plan() { StructLikeMap<List<List<FileScanTask>>> plan = planFileGroups(); RewriteExecutionContext ctx = new RewriteExecutionContext(); - Stream<RewriteFileGroup> groups = - plan.entrySet().stream() - .filter(e -> !e.getValue().isEmpty()) - .flatMap( - e -> { - StructLike partition = e.getKey(); - List<List<FileScanTask>> scanGroups = e.getValue(); - return scanGroups.stream() - .map( - tasks -> { - long inputSize = inputSize(tasks); - return newRewriteGroup( - ctx, - partition, - tasks, - inputSplitSize(inputSize), - expectedOutputFiles(inputSize)); - }); - }) - .sorted(RewriteFileGroup.comparator(rewriteJobOrder)); + List<RewriteFileGroup> selectedFileGroups = new ArrayList<>(); + AtomicInteger fileCountRunner = new AtomicInteger(); + plan.entrySet().stream() Review Comment: Thank you for the feedback @pvary . The main goal here is to optimize the `rewrite` action itself. So in essence we want users to compact data at a volume which is more practical (which is more or less a combination of number of files and the size of the load). The option might not solve all the tricks at once but it is a potential step in the right direction . You are right about the fact that we need to pick our poison so to speak in this usecase . Either we truncate the file list right after we get the total number of files returned after the scan and clearly mention to the users that the `max files to rewrite` are indeed the upper bound and IMHO users should be okay with that . Now, we could revert to filter the number of files per group right after the other filters / ops are applied but this might create the plan too big as per your valid concern. I would still be inclined to go ahead and provide this option to the users and build on it in future releases -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org