yogevyuval commented on PR #12824: URL: https://github.com/apache/iceberg/pull/12824#issuecomment-2813556075
> This is a new option when re-writing data files (Spark Actions) to provide user the ability to limit the number of files re-written to potentially reduce file OPS . This option is named as `max-files-to-rewrite` which takes a positive integer as an input, truncates the file tasks until the value is reached. In case the table has fewer files than the parameter value, all the files are processed for re-write option. A property check to ensure that no value less than 1 has also been put in place to ensure early failure. > > Implementation : > > 1. `toGroupStream` method in `RewriteDataFilesSparkAction` has been refactored to truncate the list of file scan tasks (and there by files to be processed) > 2. An atomic integer (to ensure consistency in parallel streams) called `fileCountRunner` is used to update counter as the `groupsByPartition` is processed in parallel > 3. In case the size of entire fileScanTask list in a partition is > maxFilesToRewrite + fileCountRunner, the fileScanTask list is _truncated_ to only add the files until the maxFilesToRewrite value is reached. > 4. `selectedFileGroups` is leveraged to hold the final file groups. > > Testing : > > 1. `TestRewriteDataFilesAction::testRewriteMaxFilesOption` is written to handle upper bound use case where the value `max-files-to-rewrite` > total number of files in the table > 2. `TestRewriteDataFilesAction::testRewriteMaxFilesOptionEquality` is written to handle equality use case where the value `max-files-to-rewrite` < total number of files in the table and the resulting data files after rewrite are equal to `max-files-to-rewrite` Could you elaborate on the use case that this is trying to solve? Is this to limit the resources a single job needed to reduce failures? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org