suxiaogang223 opened a new pull request, #60063: URL: https://github.com/apache/doris/pull/60063
## What problem does this PR solve? During Iceberg rewrite_data_files operations, when BE count is large, an unexpected number of small files are generated. Problem: total_files = task_count × active_BE_count × partition_count ## What is changed and how it works? Adaptive strategy that controls output files based on data volume: 1. Calculate expected file count: ceil(totalDataSize / targetFileSizeBytes) 2. For small data (expectedFileCount <= 1): Use GATHER distribution 3. For larger data: Limit parallelism to min(expectedFileCount, BE_count, default) ## Benefits - Small data: reduce from 100+ files to ~1 file (90%+ reduction) - Adaptive strategy, no manual tuning needed ## Check List - [x] Code changes - [x] Test strategy described -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
