Re: [PR] [Spark]Add max files rewrite option for RewriteAction [iceberg]

via GitHub Wed, 30 Apr 2025 01:57:26 -0700


pvary commented on PR #12824:
URL: https://github.com/apache/iceberg/pull/12824#issuecomment-2841286239


   > Sure @pvary ,the goal here is to provide users an option to limit number 
of files to rewrite which could potentially destabilize jobs in case the file 
count is super high (which is most common in large data platforms including the 
one I manage atm). Also, I plan to raise a similar PR on the flink side to 
support this option (discussed this with @mxm internally) in the future. 
   
   My question is aimed for understanding if we really need 2 different 
configs. One for the number of files in the compaction plan (this PR), and 
another for the maximum data size for the compaction plan (feature in the Flink 
PR). Maybe we can merge the two configs and provide only a single configuration 
option which could be used for both use-cases.
   In Flink's case the issue was that the planner generated too big plans, and 
the compaction took forever to finish when it was applied for the first time 
for a previously non-maintained table. Limiting the size of the data to compact 
in one run seemed like an intuitive option for the users. Subsequent compaction 
plans were ignoring the already compacted files, so the table eventually were 
fully compacted.
   Would the size based option work for your use-case, or it needs a config 
around the file sizes specifically?
   
   > However if you think we are closer in terms of merging spark rewrite 
refactor changes to main branch soon, I can wait on this changes, rebase code 
(although I am more inclined towards getting this merged first) and raise PR on 
new main branch
   
   I left one single question for @RussellSpitzer, and after the answer on that 
we can merge the other PR immediately. (I looked my PR first, and haven't 
realized that the PR is such an immediate roadblock - If I knew, I might have 
just go with the original version). Sorry for the delay.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] [Spark]Add max files rewrite option for RewriteAction [iceberg]

Reply via email to