Hi Spark devs, I'd like to call a vote on the SPIP*: Auto CDC Support for Apache Spark* Motivation
With the upcoming introduction of standardized CDC support <https://issues.apache.org/jira/browse/SPARK-55668>, Spark will soon have a unified way to produce change data feeds. However, consuming these feeds and applying them to a target table remains a significant challenge. Common patterns like SCD Type 1 (maintaining a 1:1 replica) and SCD Type 2 (tracking full change history) often require hand-crafted, complex MERGE logic. In distributed systems, these implementations are frequently error-prone when handling deletions or out-of-order data. Proposal This SPIP proposes a new "Auto CDC" flow type for Spark. It encapsulates the complex logic for SCD types and out-of-order data, allowing data engineers to configure a declarative flow instead of writing manual MERGE statements. This feature will be available in both Python and SQL. Example SQL: -- Produce a change feed CREATE STREAMING TABLE cdc.users AS SELECT * FROM STREAM my_table CHANGES FROM VERSION 10; -- Consume the change feed CREATE FLOW flow AS AUTO CDC INTO target FROM stream(cdc_data.users) KEYS (userId) APPLY AS DELETE WHEN operation = "DELETE" SEQUENCE BY sequenceNum COLUMNS * EXCEPT (operation, sequenceNum) STORED AS SCD TYPE 2 TRACK HISTORY ON * EXCEPT (city); *Relevant Links:* - SPIP Document: https://docs.google.com/document/d/1Hp5BGEYJRHbk6J7XUph3bAPZKRQXKOuV1PEaqZMMRoQ/ - *Discussion Thread: * https://lists.apache.org/thread/j6sj9wo9odgdpgzlxtvhoy7szs0jplf7 - JIRA: <https://issues.apache.org/jira/browse/SPARK-55668> https://issues.apache.org/jira/browse/SPARK-56249 *The vote will be open for at least 72 hours. *Please vote: [ ] +1: Accept the proposal as an official SPIP [ ] +0 [ ] -1: I don't think this is a good idea because ... Cheers -Andreas
