mukund-thakur opened a new issue, #16514: URL: https://github.com/apache/iceberg/issues/16514
### Feature Request / Improvement **Problem**: Let’s assume we have a large existing Iceberg table which is currently partitioned by month. After a few years, we would like to evolve the partition to have month and day as well. And now we want to rewrite all old data files using the current partition spec which is (month, day). As per the current algorithm, all the old partition spec data files are grouped in a single big group. If the data to be partitioned is large, a user will want to enable partial-progress. But then, the files are randomly split into multiple spark jobs. Thus, one partition gets processed in multiple spark jobs, which leads to small files in the resulting partition. These small files often require yet another round of compaction. **Why it creates small files:** Suppose there are 15TB of old spec data files. It will get broken into 150 spark shuffle jobs each processing 100GB of data. As the files are random in each group, every job can write files to all new partitions thus potentially leading to max of 150 files in each output spec partition. **Current Solution:** We have to run a separate compaction job to reduce the number of output files in each output partition. **Proposed Solution:** We can optimize the algorithm to create smaller groups of files per old partition even for older spec files if the current spec satisfies the older spec. By satisfies, we mean whether the new partition spec has the same ordering as the old partition spec. For example, the new partition by day on a timestamp field satisfies the old partition by month on the same timestamp field but vice-versa is not true. ### Query engine None ### Willingness to contribute - [ ] I can contribute this improvement/feature independently - [x] I would be willing to contribute this improvement/feature with guidance from the Iceberg community - [ ] I cannot contribute this improvement/feature at this time -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
