coderfender opened a new pull request, #12824:
URL: https://github.com/apache/iceberg/pull/12824

   This is a new option when re-writing data files (Spark Actions) to provide 
user the ability to limit the number of files re-written to potentially reduce 
file OPS . This option is named `max-files-to-rewrite` which takes a positive 
integer as an input, truncates the file tasks until the value is reached. In 
case the table has fewer files than the parameter value, all the files are 
processed for re-write option. A property check to ensure that no value less 
than 1 has also been put in place to ensure early failure.
   
   Implementation :
   
   toGroupStream method in RewriteDataFilesSparkAction has been refactored to 
truncate the list of file scan tasks (and there by files to be processed)
   An atomic integer (to ensure consistency in parallel streams) called 
fileCountRunner is used to update counter as the groupsByPartition is processed 
in parallel
   In case the size of entire fileScanTask list in a partition is > 
maxFilesToRewrite + fileCountRunner, the fileScanTask list is truncated to only 
add the files until the maxFilesToRewrite value is reached.
   selectedFileGroups is leveraged to hold the final file groups.
   Testing :
   
   TestRewriteDataFilesAction::testRewriteMaxFilesOption is written to handle 
upper bound use case where the value max-files-to-rewrite > total number of 
files in the table
   TestRewriteDataFilesAction::testRewriteMaxFilesOptionEquality is written to 
handle equality use case where the value max-files-to-rewrite < total number of 
files in the table and the resulting data files after rewrite are equal to 
max-files-to-rewrite


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to