aokolnychyi opened a new pull request, #9069:
URL: https://github.com/apache/iceberg/pull/9069

   This PR adjusts the split computation logic in file rewriters. The previous 
logic performed poorly in some cases.
   
   Suppose we have 4 files, 145 MB each. This means we have 580 MB to compact. 
If the target file size is 512 MB, the rewriter will decide to produce 2 output 
files as the input is 13% larger than the target file size (we allow 10% 
overhead). If this happens, the previous logic will use 580 MB / 2 = 290 MB as 
the split size. That's why the compaction will produce 2 output files that are 
already poorly sized. Such files will be picked up again in the next round even 
if there is no new data, creating a never ending loop of useless compaction. 
   
   This PR makes sure the split size is never less than the target output file.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to