amogh-jahagirdar commented on PR #12736: URL: https://github.com/apache/iceberg/pull/12736#issuecomment-2817463379
Thanks @RussellSpitzer for taking a look! >One thing I want to make sure we test is "WHEN NOT MATCHED" with no matching clauses. Spark has some logic which transforms these from row level operations into Appends and I am a little worried some of the logic here for manipulating the output and input table will not work in that case. https://github.com/apache/spark/blob/597cafc05d9a4ecd5a111d1dc9c92fb37c77ce3f/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/RewriteMergeIntoTable.scala#L47-L75 https://github.com/apache/iceberg/issues/12653 I think we may be fine there because we essentially do not care about writing row_lineage or last_updated sequence values in this case, so writing null or not writing are both valid. Exactly, the implementation as it is currently does not have an explicit rule case for projecting row IDs or sequence values in the Append case because that is a case where we know we would only be adding new rows where all the row lDs or sequence values will be null. Which means they don't need to explicitly be persisted as per the spec. This is what I was trying to say in this comment https://github.com/apache/iceberg/pull/12736/files#diff-cc38222cfec9d6a4142d6c102e3654a23ea9ec9cc1d9226ba19fbf12ae7c0329R120-R124 but let me know if I could make it more clear. Essentially when building the SparkWrite, it could either be for an an 1. overwrite files (CoW) 2. an append 3. overwrite dynamic partitions/by filter. So in both 2,3 we know it will only be addition of new rows or an addition + removal of existing rows. In these cases the row lineage fields will be null for the entire write. 1 is the only case where we need to make sure the write schema has row lineage. Let me know if I'm missing anything! I used to have an explicit handling for the merge cases which translated into appends but I realized it's not really needed so I removed it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org