Re: [PR] Spark 4.0: Preserve row lineage information on compaction [iceberg]

via GitHub Mon, 21 Jul 2025 08:11:55 -0700


amogh-jahagirdar commented on code in PR #13555:
URL: https://github.com/apache/iceberg/pull/13555#discussion_r2219496449



##########
spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/IcebergSource.java:
##########
@@ -163,6 +165,14 @@ private Spark3Util.CatalogAndIdentifier 
catalogAndIdentifier(CaseInsensitiveStri
       selector = TAG_PREFIX + tag;
     }
 
+    String groupId =
+        options.getOrDefault(
+            SparkReadOptions.SCAN_TASK_SET_ID,
+            options.get(SparkWriteOptions.REWRITTEN_FILE_SCAN_TASK_SET_ID));
+    if (groupId != null) {
+      selector = REWRITE_PREFIX + groupId.replace("-", "");

Review Comment:
   I think either is fine, since we just match against the `rewrite_` in the 
selector metadata. I just replaced the hyphens in the UUID with empty string so 
it's a bit smaller but still maintains the uniqueness that can be used in 
debugging.



##########
spark/v4.0/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteDataFilesAction.java:
##########
@@ -944,16 +983,15 @@ public void testBinPackCombineMixedFiles() {
     shouldHaveFiles(table, 3);
 
     List<Object[]> expectedRecords = currentData();
-
     int targetSize = averageFileSize(table);
-
     long dataSizeBefore = testDataSize(table);
     Result result =
         basicRewrite(table)
             .option(RewriteDataFiles.TARGET_FILE_SIZE_BYTES, 
Integer.toString(targetSize + 1000))
             .option(
                 SizeBasedFileRewritePlanner.MAX_FILE_SIZE_BYTES,
-                Integer.toString(targetSize + 80000))
+                // Increase max file size for V3 to account for additional row 
lineage fields
+                Integer.toString(targetSize + (formatVersion >= 3 ? 1850000 : 
80000)))

Review Comment:
   Ok I do think for now it's probably OK to bump these up in the tests to 
account for the additional lineage fields leading to rolling over files 
earlier. I did do some more testing with parquet V2 and delta encoding and that 
looks pretty promising in terms of being able to fit more records with lineage 
in a given file (since the row IDs on carry over at least on this test tend to 
increase sequentially in a given file).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Spark 4.0: Preserve row lineage information on compaction [iceberg]

Reply via email to