pvary commented on PR #12824: URL: https://github.com/apache/iceberg/pull/12824#issuecomment-2841286239
> Sure @pvary ,the goal here is to provide users an option to limit number of files to rewrite which could potentially destabilize jobs in case the file count is super high (which is most common in large data platforms including the one I manage atm). Also, I plan to raise a similar PR on the flink side to support this option (discussed this with @mxm internally) in the future. My question is aimed for understanding if we really need 2 different configs. One for the number of files in the compaction plan (this PR), and another for the maximum data size for the compaction plan (feature in the Flink PR). Maybe we can merge the two configs and provide only a single configuration option which could be used for both use-cases. In Flink's case the issue was that the planner generated too big plans, and the compaction took forever to finish when it was applied for the first time for a previously non-maintained table. Limiting the size of the data to compact in one run seemed like an intuitive option for the users. Subsequent compaction plans were ignoring the already compacted files, so the table eventually were fully compacted. Would the size based option work for your use-case, or it needs a config around the file sizes specifically? > However if you think we are closer in terms of merging spark rewrite refactor changes to main branch soon, I can wait on this changes, rebase code (although I am more inclined towards getting this merged first) and raise PR on new main branch I left one single question for @RussellSpitzer, and after the answer on that we can merge the other PR immediately. (I looked my PR first, and haven't realized that the PR is such an immediate roadblock - If I knew, I might have just go with the original version). Sorry for the delay. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org