Andreas Neumann created SPARK-56249:
---------------------------------------
Summary: Auto CDC support
Key: SPARK-56249
URL: https://issues.apache.org/jira/browse/SPARK-56249
Project: Spark
Issue Type: Umbrella
Components: Declarative Pipelines, PySpark, SQL
Affects Versions: 4.2.0
Reporter: Andreas Neumann
*SPIP Document:* [SPIP: Auto
CDC|https://docs.google.com/document/d/1Hp5BGEYJRHbk6J7XUph3bAPZKRQXKOuV1PEaqZMMRoQ/]
*Motivation*
With the upcoming introduction of standardized CDC support, Spark will soon
have a unified way to _produce_ change data feeds. However, _consuming_ these
feeds and applying them to a target table remains a significant challenge.
Common patterns like *SCD Type 1* (maintaining a 1:1 replica) and *SCD Type 2*
(tracking full change history) often require hand-crafted, complex MERGE logic.
In distributed systems, these implementations are frequently error-prone when
handling deletions or out-of-order data.
*Proposal*
This SPIP proposes a new *"Auto CDC"* flow type for Spark. It encapsulates the
complex logic for SCD types and out-of-order data, allowing data engineers to
configure a declarative flow instead of writing manual MERGE statements. This
feature will be available in both {*}Python and SQL{*}.
Example SQL:
{code:java}
-- Produce a change feed
CREATE STREAMING TABLE cdc.users AS
SELECT * FROM STREAM my_table CHANGES FROM VERSION 10;
-- Consume the change feed
CREATE FLOW flow
AS AUTO CDC INTO
target
FROM stream(cdc_data.users)
KEYS (userId)
APPLY AS DELETE WHEN operation = "DELETE"
SEQUENCE BY sequenceNum
COLUMNS * EXCEPT (operation, sequenceNum)
STORED AS SCD TYPE 2
TRACK HISTORY ON * EXCEPT (city);
{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]