[GitHub] [iceberg] CodingCat commented on a diff in pull request #7147: Spark 3.3: Add doc for the changelog view procedure.

via GitHub Sat, 08 Apr 2023 08:22:05 -0700


CodingCat commented on code in PR #7147:
URL: https://github.com/apache/iceberg/pull/7147#discussion_r1161124315



##########
docs/spark-procedures.md:
##########
@@ -587,3 +587,119 @@ Get all the snapshot ancestors by a particular snapshot
 CALL spark_catalog.system.ancestors_of('db.tbl', 1)
 CALL spark_catalog.system.ancestors_of(snapshot_id => 1, table => 'db.tbl')
 ```
+
+## Change Data Capture 
+
+### `create_changelog_view`
+
+Creates a view that contains the changes from a given table. 
+
+#### Remove carry-over rows
+
+The procedure removes the carry-over rows by default. Carry-over rows are the 
result of row-level operations(`MERGE`, `UPDATE` and `DELETE`)
+when using copy-on-write. For example, given a file which contains row1 
`(id=1, name='Alice')` and row2 `(id=2, name='Bob')`.
+A copy-on-write delete of row2 would require erasing this file and preserving 
row1 in a new file. The changelog table
+reports this as the following pair of rows, despite it not being an actual 
change to the table.
+
+| id  | name  | _change_type |
+|-----|-------|--------------|
+| 1   | Alice | DELETE       |
+| 1   | Alice | INSERT       |
+
+By default, this view finds the carry-over rows and removes them from the 
result. User can disable this 
+behavior by setting the `remove_carryovers` option to `false`.
+
+#### Compute pre/post update images
+
+The procedure computes the pre/post update images if configured. Pre/post 
update images are converted from a
+pair of a delete row and an insert row. Identifier columns are used for 
determining whether an insert and a delete record
+refer to the same row. If the two records share the same values for the 
identity columns they are considered to be before
+and after states of the same row. You can either set identifier fields in the 
table schema or input them as the procedure parameters.
+
+The following example shows pre/post update images computation with an 
identifier column(`id`), where a row deletion
+and an insertion with the same `id` are treated as a single update operation. 
Specifically, suppose we have the following pair of rows:
+
+| id  | name   | _change_type |
+|-----|--------|--------------|
+| 3   | Robert | DELETE       |
+| 3   | Dan    | INSERT       |
+
+In this case, the procedure marks the row before the update as an 
`UPDATE_BEFORE` image and the row after the update
+as an `UPDATE_AFTER` image, resulting in the following pre/post update images:
+
+| id  | name   | _change_type |
+|-----|--------|--------------|
+| 3   | Robert | UPDATE_BEFORE|
+| 3   | Dan    | UPDATE_AFTER |
+
+#### Usage

Review Comment:
   I think we may want to move the basic usage as the first subsection under 
`create_changelog_view` , and then some special subsections to explain what's a 
carry-over row and what is a `pre/post update` and (how to configure it with 
some examples)....which is a more straightforward tutorial structure?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] CodingCat commented on a diff in pull request #7147: Spark 3.3: Add doc for the changelog view procedure.

Reply via email to