paulpaul1076 opened a new issue, #8809: URL: https://github.com/apache/iceberg/issues/8809
### Feature Request / Improvement Delta lake has an interesting feature which you can read about here: https://docs.delta.io/latest/delta-streaming.html#idempotent-table-writes-in-foreachbatch And here:   From what I understand, iceberg does not support this, but I think that it is a really important feature. Can we add this to iceberg? I don't think that multi-table transactions will solve this problem, because from my understanding foreachBatch commits its offsets after the entire lambda function passed to it gets executed, now imagine you have this code with multi-table transactions: ``` dfStr.writeStream.foreachBatch((df: DataFrame, id: Long) => { // create transaction1 // create transaction2 // multi_table_commit(transaction1, transaction2) // send something to kafka }).start().awaitTermination() ``` From what I understand, if the "send something to kafka" step fails, the entire microbatch is re-executed and the multi-table transaction will write the same data a second time, which will cause data duplication. At my job, for example, we use this kind of logic and we frequently kill our streaming jobs to redeploy new code after which we restart them. So, from my understanding, iceberg is not an idempotent sink and you can't expect to have end-to-end exactly once with iceberg? ### Query engine Spark -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org