Hi Spark devs,

I'd like to call a vote on the SPIP*: Auto CDC Support for Apache Spark*
Motivation

With the upcoming introduction of standardized CDC support
<https://issues.apache.org/jira/browse/SPARK-55668>, Spark will soon have a
unified way to produce change data feeds. However, consuming these feeds
and applying them to a target table remains a significant challenge.

Common patterns like SCD Type 1 (maintaining a 1:1 replica) and SCD
Type 2 (tracking
full change history) often require hand-crafted, complex MERGE logic. In
distributed systems, these implementations are frequently error-prone when
handling deletions or out-of-order data.
Proposal

This SPIP proposes a new "Auto CDC" flow type for Spark. It encapsulates
the complex logic for SCD types and out-of-order data, allowing data
engineers to configure a declarative flow instead of writing manual
MERGE statements.
This feature will be available in both Python and SQL.

Example SQL:

-- Produce a change feed

CREATE STREAMING TABLE cdc.users AS

SELECT * FROM STREAM my_table CHANGES FROM VERSION 10;


-- Consume the change feed

CREATE FLOW flow

AS AUTO CDC INTO

  target

FROM stream(cdc_data.users)

  KEYS (userId)

  APPLY AS DELETE WHEN operation = "DELETE"

  SEQUENCE BY sequenceNum

  COLUMNS * EXCEPT (operation, sequenceNum)

  STORED AS SCD TYPE 2

  TRACK HISTORY ON * EXCEPT (city);


*Relevant Links:*

   - SPIP Document:
   
https://docs.google.com/document/d/1Hp5BGEYJRHbk6J7XUph3bAPZKRQXKOuV1PEaqZMMRoQ/
   -

   *Discussion Thread: *
   https://lists.apache.org/thread/j6sj9wo9odgdpgzlxtvhoy7szs0jplf7
   -

   JIRA: <https://issues.apache.org/jira/browse/SPARK-55668>
   https://issues.apache.org/jira/browse/SPARK-56249

*The vote will be open for at least 72 hours. *Please vote:

[ ] +1: Accept the proposal as an official SPIP
[ ] +0
[ ] -1: I don't think this is a good idea because ...
Cheers -Andreas

Reply via email to