Re: [PR] Spark 3.5 row lineage [iceberg]

via GitHub Sun, 20 Apr 2025 18:39:14 -0700


amogh-jahagirdar commented on PR #12736:
URL: https://github.com/apache/iceberg/pull/12736#issuecomment-2817463379

Thanks @RussellSpitzer for taking a look!

>One thing I want to make sure we test is "WHEN NOT MATCHED" with no
matching clauses. Spark has some logic which transforms these from row level
operations into Appends and I am a little worried some of the logic here for
manipulating the output and input table will not work in that case.

https://github.com/apache/spark/blob/597cafc05d9a4ecd5a111d1dc9c92fb37c77ce3f/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/RewriteMergeIntoTable.scala#L47-L75
https://github.com/apache/iceberg/issues/12653
I think we may be fine there because we essentially do not care about
writing row_lineage or last_updated sequence values in this case, so writing
null or not writing are both valid.

Exactly, the implementation as it is currently does not have an explicit
rule case for projecting row IDs or sequence values in the Append case because
that is a case where we know we would only be adding new rows where all the row
lDs or sequence values will be null. Which means they don't need to explicitly
be persisted as per the spec.

This is what I was trying to say in this comment
https://github.com/apache/iceberg/pull/12736/files#diff-cc38222cfec9d6a4142d6c102e3654a23ea9ec9cc1d9226ba19fbf12ae7c0329R120-R124
but let me know if I could make it more clear. Essentially when building the
SparkWrite, it could either be for an an
1. overwrite files (CoW)
2. an append
3. overwrite dynamic partitions/by filter.

So in both 2,3 we know it will only be addition of new rows or an addition +
removal of existing rows. In these cases the row lineage fields will be null
for the entire write. 1 is the only case where we need to make sure the write
schema has row lineage.

Let me know if I'm missing anything! I used to have an explicit handling for
the merge cases which translated into appends but I realized it's not really
needed so I removed it.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Spark 3.5 row lineage [iceberg]

Reply via email to