hongthang152 opened a new issue, #16342:
URL: https://github.com/apache/iceberg/issues/16342

   ### Query engine
   
   Spark EMR
   
   ### Question
   
   Hi folks,
   
   We're running Spark 3.5 + Iceberg 1.6 on a large-scale data pipeline that 
performs frequent MERGE INTO operations on Iceberg tables. We need to produce a 
change data feed (CDC) — i.e., for every merge job, we want to know which rows 
were inserted, updated, or deleted — so downstream consumers can process only 
the delta.
   
   ## Our constraints
   
   1. **We use Merge-on-Read (MoR)** for write performance reasons. With 
Copy-on-Write, our Spark plan uses a full outer join which is suboptimal and 
frequently causes Spark jobs to time out on our data volumes.
   2. **`create_changelog_view` does not support MoR tables.** It only works 
with Copy-on-Write today.
   3. **Switching to CoW just to get `create_changelog_view`** is not viable — 
beyond the full outer join timeout issue, the write amplification makes it 
impractical for our merge-heavy workload.
   
   ## What we've considered
   
   - **Post-merge changelog via `create_changelog_view`** — blocked by lack of 
MoR support.
   - **Generating CDC at write time** (inside the Spark writer) — this works 
for us today via a patched Iceberg build, but we'd prefer a supported upstream 
path.
   - **Upstream contribution** — we're open to contributing a mechanism that 
allows customizing or extending the Spark writer (e.g., an SPI/plugin point) so 
that CDC records can be captured during the merge write path without forking 
Iceberg.
   
   ## Questions
   
   1. Is there a recommended approach for producing CDC output from MERGE 
operations on MoR tables that we're missing?
   2. We're aware there's an open PR to support `create_changelog_view` with 
Merge-on-Read — is there a timeline or known blockers for that work?
   3. Would the community be open to a contribution that adds a writer-level 
extension point (SPI) for capturing row-level changes during merge? We'd keep 
the CDC logic external — the contribution would just be the hook/interface in 
the writer.
   
   Any guidance on the preferred direction would be really appreciated. Happy 
to provide more details on our workload characteristics if helpful.
   
   Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to