nihal111 commented on issue #12558:
URL: https://github.com/apache/iceberg/issues/12558#issuecomment-3551162202
I ended up bypassing Iceberg’s match-action semantics entirely.
Instead of relying on `WHEN MATCHED AND …` inside the MERGE (which can still
rewrite rows even when the condition is false), I compute my own CDC labels
**before** the MERGE with a FULL OUTER JOIN.
The pattern looks like this:
----------
### **1. Compute CDC using a FULL OUTER JOIN**
You join the incoming data with the target table on the primary keys and
mark each row as Insert / Update / Delete / Retain.
```SQL
WITH filtered_target AS (
SELECT *
FROM target_table
),
cdc_marked AS (
SELECT
/* coalesced primary keys */
COALESCE(target.pk1, source.pk1) AS pk1,
COALESCE(target.pk2, source.pk2) AS pk2,
...
/* carry source columns forward */
source.col1,
source.col2,
...
/* CDC classifier */
CASE
WHEN source.pk1 IS NULL AND source.pk2 IS NULL
THEN 'D' -- exists only in target
WHEN target.pk1 IS NULL AND target.pk2 IS NULL
THEN 'I' -- exists only in source
WHEN <change_condition>
THEN 'U' -- exists in both, and values differ
ELSE 'R' -- unchanged
END AS cdc
FROM filtered_target AS target
FULL OUTER JOIN source_view AS source
ON target.pk1 IS NOT DISTINCT FROM source.pk1
AND target.pk2 IS NOT DISTINCT FROM source.pk2
)
SELECT * FROM cdc_marked;
```
This generates exactly the rows you intend to act on, with an explicit CDC
label:
- `'I'` insert
- `'U'` update
- `'D'` delete
- `'R'` retain / unchanged (these should **not** touch the table)
----------
### **2. Drive MERGE logic using the CDC label**
I then filter the changes to remove the 'R' records. Now the MERGE becomes
deterministic and avoids Iceberg’s implicit “emit all” behavior.
```SQL
MERGE INTO target_table AS target
USING cdc_marked AS changes
ON target.pk1 IS NOT DISTINCT FROM changes.pk1
AND target.pk2 IS NOT DISTINCT FROM changes.pk2
WHEN MATCHED AND changes.cdc = 'U' THEN
UPDATE SET
col1 = changes.col1,
col2 = changes.col2,
...
WHEN MATCHED AND changes.cdc = 'D' THEN
DELETE
WHEN NOT MATCHED AND changes.cdc = 'I' THEN
INSERT (pk1, pk2, col1, col2, ...)
VALUES (changes.pk1, changes.pk2, changes.col1, changes.col2, ...);
```
This effectively sidesteps the unintended rewrites described in the issue.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]