1raghavmahajan opened a new pull request, #14293:
URL: https://github.com/apache/iceberg/pull/14293

   Closes https://github.com/apache/iceberg/issues/14249
   
   ## Implementation
   
   The implementation is similar to what I had initially proposed:
   
   1. Repartition by `identifier_columns` and sort within partition by 
`identifier_columns + change_ordinal`
   2. Apply `RemoveCarryoverIterator`.
   
   Note: This is the same as `net_changes` without identifier columns but with 
a simpler repartition spec.
   
   3. Use window functions to identify first and last changes for each logical 
row
   4. Filter to keep only first and last changes (as per `change_ordinal`) for 
each logical row
   
   Note: This performs the _netting_ of the changes, we get rid of all change 
except for the first and last change ordinal, this is cheaper than iterating 
through them all. Existing `net_changes` cannot leverage this as we do not have 
a consistent set of identifier columns across the entire snapshot range so we 
need to iterate through them all to build the lineage.
   
   5. Remove INSERT-DELETE (no-op) pairs using an iterator. 
   6. Calculate pre/post images using first DELETE - last INSERT pairs. This is 
similar to existing 
[ComputeUpdateIterator](https://github.com/apache/iceberg/blob/e92696998e1338fac05aedb315050d39e82b6b66/spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/ComputeUpdateIterator.java#L50)).
 Here we need to handle multiple INSERTS/DELETEs entries(as the intermediate 
changes aren't present).
   
   ## Testing
   
   - Added iterator tests for 
[ComputeNetUpdateIterator](https://github.com/1raghavmahajan/iceberg/blob/158177f565055884e3703af49a4155b1b5a60707/spark/v4.0/spark/src/test/java/org/apache/iceberg/spark/TestComputeNetUpdateIterator.java)
 and 
[RemoveNoopPairIterator](https://github.com/1raghavmahajan/iceberg/blob/158177f565055884e3703af49a4155b1b5a60707/spark/v4.0/spark/src/test/java/org/apache/iceberg/spark/TestRemoveNoopPairIterator.java)
   - Updated integration tests for 
[CreateChangelogViewProcedure](https://github.com/1raghavmahajan/iceberg/blob/158177f565055884e3703af49a4155b1b5a60707/spark/v4.0/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestCreateChangelogViewProcedure.java)
 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to