Andreas Neumann created SPARK-56249:
---------------------------------------

             Summary: Auto CDC support
                 Key: SPARK-56249
                 URL: https://issues.apache.org/jira/browse/SPARK-56249
             Project: Spark
          Issue Type: Umbrella
          Components: Declarative Pipelines, PySpark, SQL
    Affects Versions: 4.2.0
            Reporter: Andreas Neumann


*SPIP Document:* [SPIP: Auto 
CDC|https://docs.google.com/document/d/1Hp5BGEYJRHbk6J7XUph3bAPZKRQXKOuV1PEaqZMMRoQ/]

*Motivation*

With the upcoming introduction of standardized CDC support, Spark will soon 
have a unified way to _produce_ change data feeds. However, _consuming_ these 
feeds and applying them to a target table remains a significant challenge.

Common patterns like *SCD Type 1* (maintaining a 1:1 replica) and *SCD Type 2* 
(tracking full change history) often require hand-crafted, complex MERGE logic. 
In distributed systems, these implementations are frequently error-prone when 
handling deletions or out-of-order data.

*Proposal*

This SPIP proposes a new *"Auto CDC"* flow type for Spark. It encapsulates the 
complex logic for SCD types and out-of-order data, allowing data engineers to 
configure a declarative flow instead of writing manual MERGE statements. This 
feature will be available in both {*}Python and SQL{*}.

Example SQL:
{code:java}
-- Produce a change feed
CREATE STREAMING TABLE cdc.users AS
SELECT * FROM STREAM my_table CHANGES FROM VERSION 10;

-- Consume the change feed
CREATE FLOW flow
AS AUTO CDC INTO
  target
FROM stream(cdc_data.users)
  KEYS (userId)
  APPLY AS DELETE WHEN operation = "DELETE"
  SEQUENCE BY sequenceNum
  COLUMNS * EXCEPT (operation, sequenceNum)
  STORED AS SCD TYPE 2
  TRACK HISTORY ON * EXCEPT (city);
  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to