Hi Mich, I completely agree that changelogs are an important feature. However, the objective of this SPIP is to ease the processing of a changelog in order to produce a replica of the original source data. Exposing the changelog on the source side is subject of this earlier SPIP: Change Data Capture (CDC) Support <https://docs.google.com/document/d/1-4rCS3vsGIyhwnkAwPsEaqyUDg-AuVkdrYLotFPw0U0/edit?tab=t.0> .
Cheers -Andreas On Wed, Apr 1, 2026 at 4:53 AM Mich Talebzadeh <[email protected]> wrote: > I would suggest framing Auto CDC not only around MERGE/mutation, but also > around* immutability and auditability.* > > In many real-world cases, the ability to reconstruct what happened > (ordering of changes, intermediate states) is as important as computing the > latest state. From this perspective, Auto CDC should treat immutable > changelog data as a first-class abstraction, queryable via Spark SQL across > storage backends. > This enables: > > - state reconstruction (point-in-time views) > - audit / forensic analysis > - validation of upstream processes > > MERGE then becomes just one materialisation strategy, rather than the core > abstraction. > > From a connector standpoint, exposing complete, ordered changelogs is more > fundamental than supporting MERGE alone. > > HTH > > Dr Mich Talebzadeh, > Data Scientist | Distributed Systems (Spark) | Financial Forensics & > Metadata Analytics | Transaction Reconstruction | Audit & Evidence-Based > Analytics > > view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > > > > > On Tue, 31 Mar 2026 at 18:09, Anton Okolnychyi <[email protected]> > wrote: > >> I think this is going to be a great feature if it naturally extends the >> current capabilities of Spark and complements / builds on top of changelogs >> and MERGE. >> >> I would be really worried otherwise. >> >> On Tue, Mar 31, 2026 at 6:02 AM Andreas Neumann <[email protected]> wrote: >> >>> Hi Mich, >>> >>> I agree that is, in theory, possible to implement this for connectors >>> that do not support MERGE. But it is our intention to rely on MERGE support >>> for this feature. If there is large interest for it, we could follow up >>> with an implementation that does not require MERGE, but I would want >>> to evaluate first whether the additional complexity is justified. >>> >>> Note also that we will not require changelogs. That is a requirement for >>> the source that produces the CDC feed, as addressed by this existing SPIP: >>> SPIP: >>> Change Data Capture (CDC) Support >>> <https://docs.google.com/document/d/1-4rCS3vsGIyhwnkAwPsEaqyUDg-AuVkdrYLotFPw0U0/edit?usp=sharing>. >>> I actually expect that many use cases will come from sources that do not >>> support MERGE but do support change feeds. >>> >>> Cheers -Andreas >>> >>> On Mon, Mar 30, 2026 at 2:09 PM Mich Talebzadeh < >>> [email protected]> wrote: >>> >>>> Hello, >>>> >>>> Auto CDC may compile to a MERGE-like row-level plan, but the SPIP >>>> should describe that as one implementation strategy, *not necessarily >>>> the only one*. Connectors do not just need changelogs and MERGE; they >>>> need changelog semantics on the read side and row-level mutation capability >>>> on the write side, plus keys and usually sequencing. >>>> >>>> HTH >>>> >>>> Dr Mich Talebzadeh, >>>> Data Scientist | Distributed Systems (Spark) | Financial Forensics & >>>> Metadata Analytics | Transaction Reconstruction | Audit & Evidence-Based >>>> Analytics >>>> >>>> view my Linkedin profile >>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>> >>>> >>>> On Mon, 30 Mar 2026 at 16:49, Anton Okolnychyi <[email protected]> >>>> wrote: >>>> >>>>> Will auto CDC compile into or use MERGE under the hood? If yes, can we >>>>> include a sketch of how that rewrite will look like in the SPIP? What >>>>> exactly do connectors need to support to benefit from auto CDC? Just >>>>> support changelogs and MERGE? >>>>> >>>>> - Anton >>>>> >>>>> пн, 30 бер. 2026 р. о 07:44 Andreas Neumann <[email protected]> пише: >>>>> >>>>>> Hi Vaquar, >>>>>> >>>>>> I responded to most of your comments on the document itself. >>>>>> Addtional comments inline >>>>>> >>>>>> Cheers -Andreas >>>>>> >>>>>> On Sat, Mar 28, 2026 at 4:42 PM vaquar khan <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> HI . >>>>>>> Thanks for the SPIP. I fully support the goal-abstracting CDC merge >>>>>>> logic is a huge win for the community. However, looking at the current >>>>>>> Spark versions, there are significant architectural gaps between >>>>>>> Databricks >>>>>>> Lakeflow's proprietary implementation and OSS Spark. >>>>>>> >>>>>>> A few technical blockers need clarification before we move forward: >>>>>>> >>>>>>> - OSS Compatibility: Databricks documentation explicitly states that >>>>>>> the AUTO CDC APIs are not supported by Apache Spark Declarative >>>>>>> Pipelines <https://docs.databricks.com/gcp/en/ldp/cdc>. >>>>>>> >>>>>> That will change with the implementation of this SPIP. >>>>>> >>>>>> >>>>>>> - Streaming MERGE: The proposed flow requires continuous >>>>>>> upsert/delete semantics, but Dataset.mergeInto() currently does not >>>>>>> support >>>>>>> streaming queries. Does this SPIP introduce an entirely new execution >>>>>>> path >>>>>>> to bypass this restriction >>>>>>> >>>>>> This works with foreachBatch. >>>>>> >>>>>> >>>>>>> - Tombstone Garbage Collection: Handling stream deletes safely >>>>>>> requires state store tombstone retention (e.g., configuring >>>>>>> pipelines.cdc.tombstoneGCThresholdInSeconds) to prevent late-arriving >>>>>>> data >>>>>>> from resurrecting deleted keys. How will this be implemented natively in >>>>>>> OSS Spark state stores? >>>>>>> >>>>>> That's an interesting question. Tombstones could be modeled in the >>>>>> state store, but we are thinking that they will be modeled as an explicit >>>>>> output of the flow, either as records in the output table with a "deleted >>>>>> at" marker, possibly with a view on top to project away these rows; or >>>>>> in a >>>>>> separate output that contains only the tombstones. The exact design is >>>>>> not >>>>>> finalized, that is part of the first phase of the project. >>>>>> >>>>>> - Sequencing Constraints: SEQUENCE BY enforces strict ordering where >>>>>>> NULL sequencing values are explicitly not supported. How will the engine >>>>>>> handle malformed or non-monotonic upstream sequences compared to our >>>>>>> existing time-based watermarks? >>>>>>> >>>>>> I think malformed change events should, at least in the first >>>>>> iteration, fail the stream. Otherwise there is a risk of writing >>>>>> incorrect >>>>>> data. >>>>>> >>>>>> >>>>>>> >>>>>>> - Given the massive surface area (new SQL DDL, streaming MERGE >>>>>>> paths, SCD Type 1/2 state logic, tombstone GC), a phased delivery plan >>>>>>> would be very helpful. It would also clarify exactly which Lakeflow >>>>>>> components are being contributed to open-source versus what needs to be >>>>>>> rebuilt from scratch. >>>>>>> >>>>>>> >>>>>>> Best regards, >>>>>>> Viquar Khan >>>>>>> >>>>>>> On Sat, 28 Mar 2026 at 08:35, 陈 小健 <[email protected]> wrote: >>>>>>> >>>>>>>> unsubscribe >>>>>>>> >>>>>>>> 获取Outlook for Android <https://aka.ms/AAb9ysg> >>>>>>>> ------------------------------ >>>>>>>> *From:* Andreas Neumann <[email protected]> >>>>>>>> *Sent:* Saturday, March 28, 2026 2:43:54 AM >>>>>>>> *To:* [email protected] <[email protected]> >>>>>>>> *Subject:* Re: SPIP: Auto CDC support for Apache Spark >>>>>>>> >>>>>>>> Hi Vaibhav, >>>>>>>> >>>>>>>> The goal of this proposal is not to replace MERGE but to provide a >>>>>>>> simple abstraction for the common use case of CDC. >>>>>>>> MERGE itself is a very powerful operator and there will always be >>>>>>>> use cases outside of CDC that will require MERGE. >>>>>>>> >>>>>>>> And thanks for spotting the typo in the SPIP. It is fixed now! >>>>>>>> >>>>>>>> Cheers -Andreas >>>>>>>> >>>>>>>> >>>>>>>> On Fri, Mar 27, 2026 at 10:53 AM Vaibhav Kumar < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>> Hi Andrew, >>>>>>>> >>>>>>>> Thanks for sharing the SPIP, Does that mean the MERGE statement >>>>>>>> would be deprecated? Also I think there was a small typo I have >>>>>>>> suggested >>>>>>>> in the doc. >>>>>>>> >>>>>>>> Regards, >>>>>>>> Vaibhav >>>>>>>> >>>>>>>> On Fri, Mar 27, 2026 at 10:15 AM DB Tsai <[email protected]> wrote: >>>>>>>> >>>>>>>> +1 >>>>>>>> >>>>>>>> DB Tsai | https://www.dbtsai.com/ | PGP 42E5B25A8F7A82C1 >>>>>>>> >>>>>>>> On Mar 26, 2026, at 6:08 PM, Andreas Neumann <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>> Hi all, >>>>>>>> >>>>>>>> I’d like to start a discussion on a new SPIP to introduce Auto CDC >>>>>>>> support to Apache Spark. >>>>>>>> >>>>>>>> - SPIP Document: >>>>>>>> >>>>>>>> https://docs.google.com/document/d/1Hp5BGEYJRHbk6J7XUph3bAPZKRQXKOuV1PEaqZMMRoQ/ >>>>>>>> - >>>>>>>> >>>>>>>> JIRA: <https://issues.apache.org/jira/browse/SPARK-55668> >>>>>>>> https://issues.apache.org/jira/browse/SPARK-5566 >>>>>>>> >>>>>>>> Motivation >>>>>>>> >>>>>>>> With the upcoming introduction of standardized CDC support >>>>>>>> <https://issues.apache.org/jira/browse/SPARK-55668>, Spark will >>>>>>>> soon have a unified way to produce change data feeds. However, >>>>>>>> consuming these feeds and applying them to a target table remains >>>>>>>> a significant challenge. >>>>>>>> >>>>>>>> Common patterns like SCD Type 1 (maintaining a 1:1 replica) and SCD >>>>>>>> Type 2 (tracking full change history) often require hand-crafted, >>>>>>>> complex MERGE logic. In distributed systems, these implementations >>>>>>>> are frequently error-prone when handling deletions or out-of-order >>>>>>>> data. >>>>>>>> Proposal >>>>>>>> >>>>>>>> This SPIP proposes a new "Auto CDC" flow type for Spark. It >>>>>>>> encapsulates the complex logic for SCD types and out-of-order data, >>>>>>>> allowing data engineers to configure a declarative flow instead of >>>>>>>> writing >>>>>>>> manual MERGE statements. This feature will be available in both Python >>>>>>>> and SQL. >>>>>>>> Example SQL: >>>>>>>> -- Produce a change feed >>>>>>>> CREATE STREAMING TABLE cdc.users AS >>>>>>>> SELECT * FROM STREAM my_table CHANGES FROM VERSION 10; >>>>>>>> >>>>>>>> -- Consume the change feed >>>>>>>> CREATE FLOW flow >>>>>>>> AS AUTO CDC INTO >>>>>>>> target >>>>>>>> FROM stream(cdc_data.users) >>>>>>>> KEYS (userId) >>>>>>>> APPLY AS DELETE WHEN operation = "DELETE" >>>>>>>> SEQUENCE BY sequenceNum >>>>>>>> COLUMNS * EXCEPT (operation, sequenceNum) >>>>>>>> STORED AS SCD TYPE 2 >>>>>>>> TRACK HISTORY ON * EXCEPT (city); >>>>>>>> >>>>>>>> >>>>>>>> Please review the full SPIP for the technical details. Looking >>>>>>>> forward to your feedback and discussion! >>>>>>>> >>>>>>>> Best regards, >>>>>>>> >>>>>>>> Andreas >>>>>>>> >>>>>>>> >>>>>>>>
