1raghavmahajan opened a new pull request, #14293: URL: https://github.com/apache/iceberg/pull/14293
Closes https://github.com/apache/iceberg/issues/14249 ## Implementation The implementation is similar to what I had initially proposed: 1. Repartition by `identifier_columns` and sort within partition by `identifier_columns + change_ordinal` 2. Apply `RemoveCarryoverIterator`. Note: This is the same as `net_changes` without identifier columns but with a simpler repartition spec. 3. Use window functions to identify first and last changes for each logical row 4. Filter to keep only first and last changes (as per `change_ordinal`) for each logical row Note: This performs the _netting_ of the changes, we get rid of all change except for the first and last change ordinal, this is cheaper than iterating through them all. Existing `net_changes` cannot leverage this as we do not have a consistent set of identifier columns across the entire snapshot range so we need to iterate through them all to build the lineage. 5. Remove INSERT-DELETE (no-op) pairs using an iterator. 6. Calculate pre/post images using first DELETE - last INSERT pairs. This is similar to existing [ComputeUpdateIterator](https://github.com/apache/iceberg/blob/e92696998e1338fac05aedb315050d39e82b6b66/spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/ComputeUpdateIterator.java#L50)). Here we need to handle multiple INSERTS/DELETEs entries(as the intermediate changes aren't present). ## Testing - Added iterator tests for [ComputeNetUpdateIterator](https://github.com/1raghavmahajan/iceberg/blob/158177f565055884e3703af49a4155b1b5a60707/spark/v4.0/spark/src/test/java/org/apache/iceberg/spark/TestComputeNetUpdateIterator.java) and [RemoveNoopPairIterator](https://github.com/1raghavmahajan/iceberg/blob/158177f565055884e3703af49a4155b1b5a60707/spark/v4.0/spark/src/test/java/org/apache/iceberg/spark/TestRemoveNoopPairIterator.java) - Updated integration tests for [CreateChangelogViewProcedure](https://github.com/1raghavmahajan/iceberg/blob/158177f565055884e3703af49a4155b1b5a60707/spark/v4.0/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestCreateChangelogViewProcedure.java) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
