aokolnychyi opened a new pull request, #9069: URL: https://github.com/apache/iceberg/pull/9069
This PR adjusts the split computation logic in file rewriters. The previous logic performed poorly in some cases. Suppose we have 4 files, 145 MB each. This means we have 580 MB to compact. If the target file size is 512 MB, the rewriter will decide to produce 2 output files as the input is 13% larger than the target file size (we allow 10% overhead). If this happens, the previous logic will use 580 MB / 2 = 290 MB as the split size. That's why the compaction will produce 2 output files that are already poorly sized. Such files will be picked up again in the next round even if there is no new data, creating a never ending loop of useless compaction. This PR makes sure the split size is never less than the target output file. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org