Re: [VOTE] SPIP: Auto CDC Support for Apache Spark

vaquar khan Sat, 04 Apr 2026 09:45:16 -0700

+1

Regards,
Viquar Khan


On Sat, 4 Apr 2026 at 11:14, Lisa N. Cao <[email protected]> wrote:

> +1 (non-binding)
>
> --
> LNC
>
> On Fri, Apr 3, 2026, 5:15 PM Shixiong Zhu <[email protected]> wrote:
>
>> +1
>>
>>
>> On Fri, Apr 3, 2026 at 5:03 PM Mich Talebzadeh <[email protected]>
>> wrote:
>>
>>> +1
>>>
>>> Dr Mich Talebzadeh,
>>> Data Scientist | Distributed Systems (Spark) | Financial Forensics &
>>> Metadata Analytics | Transaction Reconstruction | Audit & Evidence-Based
>>> Analytics
>>>
>>>    view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>
>>>
>>>
>>> On Fri, 3 Apr 2026 at 23:00, Andreas Neumann <[email protected]> wrote:
>>>
>>>> Hi Spark devs,
>>>>
>>>> I'd like to call a vote on the SPIP*: Auto CDC Support for Apache
>>>> Spark*
>>>> Motivation
>>>>
>>>> With the upcoming introduction of standardized CDC support
>>>> <https://issues.apache.org/jira/browse/SPARK-55668>, Spark will soon
>>>> have a unified way to produce change data feeds. However, consuming these
>>>> feeds and applying them to a target table remains a significant challenge.
>>>>
>>>> Common patterns like SCD Type 1 (maintaining a 1:1 replica) and SCD
>>>> Type 2 (tracking full change history) often require hand-crafted,
>>>> complex MERGE logic. In distributed systems, these implementations are
>>>> frequently error-prone when handling deletions or out-of-order data.
>>>> Proposal
>>>>
>>>> This SPIP proposes a new "Auto CDC" flow type for Spark. It
>>>> encapsulates the complex logic for SCD types and out-of-order data,
>>>> allowing data engineers to configure a declarative flow instead of writing
>>>> manual MERGE statements. This feature will be available in both Python
>>>> and SQL.
>>>>
>>>> Example SQL:
>>>>
>>>> -- Produce a change feed
>>>>
>>>> CREATE STREAMING TABLE cdc.users AS
>>>>
>>>> SELECT * FROM STREAM my_table CHANGES FROM VERSION 10;
>>>>
>>>>
>>>> -- Consume the change feed
>>>>
>>>> CREATE FLOW flow
>>>>
>>>> AS AUTO CDC INTO
>>>>
>>>>   target
>>>>
>>>> FROM stream(cdc_data.users)
>>>>
>>>>   KEYS (userId)
>>>>
>>>>   APPLY AS DELETE WHEN operation = "DELETE"
>>>>
>>>>   SEQUENCE BY sequenceNum
>>>>
>>>>   COLUMNS * EXCEPT (operation, sequenceNum)
>>>>
>>>>   STORED AS SCD TYPE 2
>>>>
>>>>   TRACK HISTORY ON * EXCEPT (city);
>>>>
>>>>
>>>> *Relevant Links:*
>>>>
>>>>    - SPIP Document:
>>>>    
>>>> https://docs.google.com/document/d/1Hp5BGEYJRHbk6J7XUph3bAPZKRQXKOuV1PEaqZMMRoQ/
>>>>    -
>>>>
>>>>    *Discussion Thread: *
>>>>    https://lists.apache.org/thread/j6sj9wo9odgdpgzlxtvhoy7szs0jplf7
>>>>    -
>>>>
>>>>    JIRA: <https://issues.apache.org/jira/browse/SPARK-55668>
>>>>    https://issues.apache.org/jira/browse/SPARK-56249
>>>>
>>>> *The vote will be open for at least 72 hours. *Please vote:
>>>>
>>>> [ ] +1: Accept the proposal as an official SPIP
>>>> [ ] +0
>>>> [ ] -1: I don't think this is a good idea because ...
>>>> Cheers -Andreas
>>>>
>>>>
>>>>

Re: [VOTE] SPIP: Auto CDC Support for Apache Spark

Reply via email to