Re: [PR] [Spark]Add max files rewrite option for RewriteAction [iceberg]

via GitHub Fri, 09 May 2025 09:34:38 -0700


coderfender commented on code in PR #12824:
URL: https://github.com/apache/iceberg/pull/12824#discussion_r2082047278



##########
core/src/main/java/org/apache/iceberg/actions/BinPackRewriteFilePlanner.java:
##########
@@ -199,30 +214,48 @@ protected long defaultTargetFileSize() {
   public FileRewritePlan<FileGroupInfo, FileScanTask, DataFile, 
RewriteFileGroup> plan() {
     StructLikeMap<List<List<FileScanTask>>> plan = planFileGroups();
     RewriteExecutionContext ctx = new RewriteExecutionContext();
-    Stream<RewriteFileGroup> groups =
-        plan.entrySet().stream()
-            .filter(e -> !e.getValue().isEmpty())
-            .flatMap(
-                e -> {
-                  StructLike partition = e.getKey();
-                  List<List<FileScanTask>> scanGroups = e.getValue();
-                  return scanGroups.stream()
-                      .map(
-                          tasks -> {
-                            long inputSize = inputSize(tasks);
-                            return newRewriteGroup(
-                                ctx,
-                                partition,
-                                tasks,
-                                inputSplitSize(inputSize),
-                                expectedOutputFiles(inputSize));
-                          });
-                })
-            .sorted(RewriteFileGroup.comparator(rewriteJobOrder));
+    List<RewriteFileGroup> selectedFileGroups = new ArrayList<>();
+    AtomicInteger fileCountRunner = new AtomicInteger();
+    plan.entrySet().stream()

Review Comment:
   Sure the thought behind this PR was to give option to users to limit the 
number of files being rewritten (either due to load or due to NN pressure ).  I 
believe that we can avoid creating a `heavy` plan object by truncating it to 
process by `Push down the filter to planFileGroups` is a good idea. However, do 
we still solve the problem of not building a large plan by moving the filter to 
planFileGroups method is not something I am completely clear about . We would 
still have to scan the table to get the files present right ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [Spark]Add max files rewrite option for RewriteAction [iceberg]

Reply via email to