amogh-jahagirdar commented on code in PR #13555: URL: https://github.com/apache/iceberg/pull/13555#discussion_r2212581830
########## spark/v4.0/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteDataFilesAction.java: ########## @@ -300,7 +303,7 @@ public void testBinPackAfterPartitionChange() { Integer.toString(averageFileSize(table) + 1000)) .option( RewriteDataFiles.TARGET_FILE_SIZE_BYTES, - Integer.toString(averageFileSize(table) + 1001)) + Integer.toString(averageFileSize(table) + 11000)) Review Comment: There is this and 2 other test cases where I made a similar change to increase target file size or max file write size. It's not the real solution but essentially after these changes to preserve lineage, what's happening is that we are writing just a little bit more data for the extra lineage fields on materialization. They compress well on disk but the presence of the additional fields still slightly throws off the number of output files on the rewrite. Specifically, the presence of the extra columns means that we are more quickly hitting the max write file size after which the writer rolls over. We output the majority of the files in the appropriate size, but we also produce additional small files. Just slightly increasing the target file write size means we eliminate the production of smaller files as a result of rolling over a little bit more aggressively than needed. I think for V3+ tables, we should re-evaluate the default 1.8x ratio because the lineage fields will be required, and probably bump that up so we don't regress on specific compaction workloads. cc @aokolnychyi @stevenzwu @RussellSpitzer -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org