moomindani opened a new pull request, #15594:
URL: https://github.com/apache/iceberg/pull/15594
## Summary
Closes #15593
Introduces a new `scd_type2 => boolean` parameter to the
`create_changelog_view` Spark procedure that natively produces a Slowly
Changing Dimensions Type-2 (SCD Type-2) history view from an Iceberg changelog
table, without requiring users to write custom SQL.
### New columns added to the view
| Column | Type | Nullable | Description |
|---|---|---|---|
| `_valid_from` | TIMESTAMP | no | Commit timestamp when this version became
active |
| `_valid_to` | TIMESTAMP | yes | Commit timestamp when superseded; `NULL`
if still active |
| `_is_current` | BOOLEAN | no | `true` when `_valid_to IS NULL AND
_change_type != 'DELETE'` |
### Example
```sql
CALL catalog.system.create_changelog_view(
table => 'db.products',
changelog_view => 'products_history',
identifier_columns => array('product_id'),
scd_type2 => true
)
```
Given inserts and updates to `products`, the resulting view:
| product_id | name | _change_type | _valid_from | _valid_to | _is_current |
|---|---|---|---|---|---|
| 1 | Widget | INSERT | 2024-01-01 | 2024-02-01 | false |
| 1 | Widget Pro | UPDATE_AFTER | 2024-02-01 | NULL | true |
### Design choices
- `NULL` sentinel for open-ended rows (not `9999-12-31`) — aligns with dbt
and Databricks DLT
- `_valid_from`/`_valid_to` naming — aligns with dbt
`dbt_valid_from`/`dbt_valid_to` and Kimball semantics
- `UPDATE_BEFORE` rows filtered out — intermediate artifacts not meaningful
in an SCD Type-2 view
- DELETE rows kept with `_is_current = false` — tracks hard deletes
- `_is_current` convenience flag added — avoids the two-condition footgun
(`_valid_to IS NULL AND _change_type != 'DELETE'`)
### Constraints
- Requires `identifier_columns` (explicit or from table schema)
- Incompatible with `net_changes = true`
- Implicitly forces `compute_updates = true`
## Changes
- `CreateChangelogViewProcedure.java` — Spark v3.4, v3.5, v4.0, v4.1
- `TestCreateChangelogViewProcedure.java` — Spark v3.4, v3.5, v4.0, v4.1
## Test plan
- [x] `testScdType2BasicInsertUpdateDelete` — INSERT → UPDATE → DELETE
lifecycle, verifies all 3 new columns
- [x] `testScdType2CurrentRows` — only non-deleted, most-recent rows have
`_is_current = true`
- [x] `testScdType2HardDelete` — DELETE row has `_valid_to = NULL`,
`_is_current = false`
- [x] `testScdType2RequiresIdentifierColumns` — error when no identifier
columns
- [x] `testScdType2IncompatibleWithNetChanges` — error when combined with
`net_changes = true`
- [x] `testScdType2OutputSchema` — verifies all 8 output columns in correct
order
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]