coderfender opened a new pull request, #12824: URL: https://github.com/apache/iceberg/pull/12824
This is a new option when re-writing data files (Spark Actions) to provide user the ability to limit the number of files re-written to potentially reduce file OPS . This option is named `max-files-to-rewrite` which takes a positive integer as an input, truncates the file tasks until the value is reached. In case the table has fewer files than the parameter value, all the files are processed for re-write option. A property check to ensure that no value less than 1 has also been put in place to ensure early failure. Implementation : toGroupStream method in RewriteDataFilesSparkAction has been refactored to truncate the list of file scan tasks (and there by files to be processed) An atomic integer (to ensure consistency in parallel streams) called fileCountRunner is used to update counter as the groupsByPartition is processed in parallel In case the size of entire fileScanTask list in a partition is > maxFilesToRewrite + fileCountRunner, the fileScanTask list is truncated to only add the files until the maxFilesToRewrite value is reached. selectedFileGroups is leveraged to hold the final file groups. Testing : TestRewriteDataFilesAction::testRewriteMaxFilesOption is written to handle upper bound use case where the value max-files-to-rewrite > total number of files in the table TestRewriteDataFilesAction::testRewriteMaxFilesOptionEquality is written to handle equality use case where the value max-files-to-rewrite < total number of files in the table and the resulting data files after rewrite are equal to max-files-to-rewrite -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org