Re: [PR] Spark 4.0: Preserve row lineage information on compaction [iceberg]

via GitHub Thu, 17 Jul 2025 00:54:17 -0700


amogh-jahagirdar commented on code in PR #13555:
URL: https://github.com/apache/iceberg/pull/13555#discussion_r2212581830



##########
spark/v4.0/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteDataFilesAction.java:
##########
@@ -300,7 +303,7 @@ public void testBinPackAfterPartitionChange() {
                 Integer.toString(averageFileSize(table) + 1000))
             .option(
                 RewriteDataFiles.TARGET_FILE_SIZE_BYTES,
-                Integer.toString(averageFileSize(table) + 1001))
+                Integer.toString(averageFileSize(table) + 11000))

Review Comment:
   There is this and 2 other test cases where I made a similar change to 
increase target file size or max file write size.
   
   It's not the real solution but essentially after these changes to preserve 
lineage, what's happening is that we are writing just a little bit more data 
for the extra lineage fields on materialization. They compress well on disk but 
the presence of the additional fields still slightly throws off the number of 
output files on the rewrite. Specifically, the presence of the extra columns 
means that we are more quickly hitting the max write file size after which the 
writer rolls over. We output the majority of the files in the appropriate size, 
but we also produce additional small files.
   
   Just slightly increasing the target file write size means we eliminate the 
production of smaller files as a result of rolling over a little bit more 
aggressively than needed. 
   
   I think for V3+ tables, we should re-evaluate the default 1.8x ratio because 
the lineage fields will be required, and probably bump that up so we don't 
regress on specific compaction workloads. cc @aokolnychyi @stevenzwu 
@RussellSpitzer 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Spark 4.0: Preserve row lineage information on compaction [iceberg]

Reply via email to