Re: [PR] [Spark]Add max files rewrite option for RewriteAction [iceberg]

via GitHub Thu, 17 Apr 2025 09:58:13 -0700


yogevyuval commented on PR #12824:
URL: https://github.com/apache/iceberg/pull/12824#issuecomment-2813556075


   > This is a new option when re-writing data files (Spark Actions) to provide 
user the ability to limit the number of files re-written to potentially reduce 
file OPS . This option is named as `max-files-to-rewrite` which takes a 
positive integer as an input, truncates the file tasks until the value is 
reached. In case the table has fewer files than the parameter value, all the 
files are processed for re-write option. A property check to ensure that no 
value less than 1 has also been put in place to ensure early failure.
   > 
   > Implementation :
   > 
   > 1. `toGroupStream` method in `RewriteDataFilesSparkAction` has been 
refactored to truncate the list of file scan tasks (and there by files to be 
processed)
   > 2. An atomic integer (to ensure consistency in parallel streams) called 
`fileCountRunner` is used to update counter as the `groupsByPartition` is 
processed in parallel
   > 3. In case the size of entire fileScanTask list in a partition is > 
maxFilesToRewrite +  fileCountRunner, the fileScanTask list is _truncated_ to 
only add the files until the maxFilesToRewrite value is reached.
   > 4. `selectedFileGroups` is leveraged to hold the final file groups.
   > 
   > Testing :
   > 
   > 1. `TestRewriteDataFilesAction::testRewriteMaxFilesOption` is written to 
handle upper bound use case where the value `max-files-to-rewrite` > total 
number of files in the table
   > 2. `TestRewriteDataFilesAction::testRewriteMaxFilesOptionEquality` is 
written to handle equality use case where the value `max-files-to-rewrite`  < 
total number of files in the table and the resulting data files after rewrite 
are equal to `max-files-to-rewrite`
   
   Could you elaborate on the use case that this is trying to solve? Is this to 
limit the resources a single job needed to reduce failures?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [Spark]Add max files rewrite option for RewriteAction [iceberg]

Reply via email to