albertmorenosugr commented on issue #14851: URL: https://github.com/apache/iceberg/issues/14851#issuecomment-3673006209
Hi @willjanning In Spark 3.5, the observed behavior is caused by how MERGE INTO materializes its output at the data-file level, not by changelog computation. When a table is initialized with INSERT INTO, Spark plans the VALUES input as a partitioned local relation and executes **multiple parallel write tasks, each producing an immutable Parquet data file**. As a result, rows are distributed across multiple data files. A subsequent MERGE INTO first performs a matching phase to identify affected data files, and only the data file containing the matched row is rewritten. Rows stored in other data files remain referenced from the previous snapshot and therefore correctly retain their original change_ordinal. When a table is initialized with MERGE INTO, Spark 3.5 produces the merge result as a single logical output dataset that is written into a **single Parquet data file**, even though the input is logically partitioned. Consequently, all inserted rows are co-located in the same data file. During a subsequent MERGE INTO, the matching phase correctly identifies the data file containing the matched row; however, that same file also contains logically unchanged rows. Because Parquet files are immutable, Spark must rewrite the entire data file during the merge rewrite phase. This forces logically NO-OP rows to be physically rewritten and included in the new snapshot, which correctly results in those rows being assigned change_ordinal = 1. Therefore, the issue is a Spark 3.5 MERGE INTO execution artifact: coarse data-file output from the initial merge causes unnecessary rewrites in later merges. Iceberg behaves correctly given the physical rewrite boundaries imposed by Spark. In Spark 4.x, MERGE INTO is expressed as a true row-level operation under DataSource V2, with a clearer separation between row-level change detection and physical file materialization. The merge output is no longer forced into a single data file, allowing rewrites to be isolated to affected rows’ data files and preventing unnecessary rewrites of logically unchanged rows. As a result, change ordinals are preserved correctly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
