Re: [PR] Spark 4.0: Preserve row lineage information on compaction [iceberg]

via GitHub Mon, 21 Jul 2025 16:03:01 -0700


amogh-jahagirdar commented on code in PR #13555:
URL: https://github.com/apache/iceberg/pull/13555#discussion_r2220529650



##########
spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/IcebergSource.java:
##########
@@ -163,6 +165,14 @@ private Spark3Util.CatalogAndIdentifier 
catalogAndIdentifier(CaseInsensitiveStri
       selector = TAG_PREFIX + tag;
     }
 
+    String groupId =
+        options.getOrDefault(
+            SparkReadOptions.SCAN_TASK_SET_ID,
+            options.get(SparkWriteOptions.REWRITTEN_FILE_SCAN_TASK_SET_ID));
+    if (groupId != null) {
+      selector = REWRITE_PREFIX + groupId.replace("-", "");

Review Comment:
    I'm not sure about a 120 character limit (did some investigation couldn't 
find anything related) but even then I think we're still safe for this rewrite 
case because for the rewrite case the identifier is the groupId UUID and then 
there's the "#" followed by the selector which is another UUID. This UUID gets 
mapped to the actual table reference in SparkTableCache during the compaction 
job. This combined is 81 characters.
   
   Actually saying this out loud made me realize that there's really no value 
to adding the same UUID again for the rewrite case, so we can actually just 
simplify this a bit :)
   
   So we should probably just do something like have the selector just be the 
word "rewrite"
   Then we just have a UUID + "#_rewrite", which is then just 45 bytes.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Spark 4.0: Preserve row lineage information on compaction [iceberg]

Reply via email to