Re: [PR] [Spark]Add max files rewrite option for RewriteAction [iceberg]

via GitHub Mon, 12 May 2025 10:31:37 -0700


coderfender commented on code in PR #12824:
URL: https://github.com/apache/iceberg/pull/12824#discussion_r2085140996



##########
core/src/main/java/org/apache/iceberg/actions/BinPackRewriteFilePlanner.java:
##########
@@ -199,30 +214,48 @@ protected long defaultTargetFileSize() {
   public FileRewritePlan<FileGroupInfo, FileScanTask, DataFile, 
RewriteFileGroup> plan() {
     StructLikeMap<List<List<FileScanTask>>> plan = planFileGroups();
     RewriteExecutionContext ctx = new RewriteExecutionContext();
-    Stream<RewriteFileGroup> groups =
-        plan.entrySet().stream()
-            .filter(e -> !e.getValue().isEmpty())
-            .flatMap(
-                e -> {
-                  StructLike partition = e.getKey();
-                  List<List<FileScanTask>> scanGroups = e.getValue();
-                  return scanGroups.stream()
-                      .map(
-                          tasks -> {
-                            long inputSize = inputSize(tasks);
-                            return newRewriteGroup(
-                                ctx,
-                                partition,
-                                tasks,
-                                inputSplitSize(inputSize),
-                                expectedOutputFiles(inputSize));
-                          });
-                })
-            .sorted(RewriteFileGroup.comparator(rewriteJobOrder));
+    List<RewriteFileGroup> selectedFileGroups = new ArrayList<>();
+    AtomicInteger fileCountRunner = new AtomicInteger();
+    plan.entrySet().stream()

Review Comment:
   Thank you for the feedback @pvary . The main goal here is to optimize the 
`rewrite` action itself. So in essence we want users to compact data at a 
volume which is more practical (which is more or less a combination of number 
of files and the size of the load).  The option might not solve all the tricks 
at once but it is a potential step in the right direction .
   
    You are right about the fact that we need to pick our poison so to speak in 
this usecase . Either we truncate the file list right after we get the total 
number of files returned after the scan and clearly mention to the users that 
the `max files to rewrite` are indeed the upper bound and IMHO users should be 
okay with that . Now, we could revert to filter the number of files per group 
right after the other filters / ops are applied but this might  create the plan 
too big as per your valid concern. 
   I would still be inclined to go ahead and provide this option to the users 
and build on it in future releases   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] [Spark]Add max files rewrite option for RewriteAction [iceberg]

Reply via email to