[I] Make iceberg an idempotent sink for Spark like delta lake [iceberg]

via GitHub Thu, 12 Oct 2023 05:37:12 -0700


paulpaul1076 opened a new issue, #8809:
URL: https://github.com/apache/iceberg/issues/8809


   ### Feature Request / Improvement
   
   Delta lake has an interesting feature which you can read about here: 
https://docs.delta.io/latest/delta-streaming.html#idempotent-table-writes-in-foreachbatch
   And here:
   
![image](https://github.com/apache/iceberg/assets/4533296/f3817344-0337-4b6c-873f-4bb51f0da78a)
   
![image](https://github.com/apache/iceberg/assets/4533296/1c688140-7694-46a1-a31f-e2bf57658d97)
   
   From what I understand, iceberg does not support this, but I think that it 
is a really important feature. Can we add this to iceberg?
   
   I don't think that multi-table transactions will solve this problem, because 
from my understanding foreachBatch commits its offsets after the entire lambda 
function passed to it gets executed, now imagine you have this code with 
multi-table transactions:
   
   ```
     dfStr.writeStream.foreachBatch((df: DataFrame, id: Long) => {
       // create transaction1
       // create transaction2
       // multi_table_commit(transaction1, transaction2)
       // send something to kafka
     }).start().awaitTermination()
     ```
     
    From what I understand, if the "send something to kafka" step fails, the 
entire microbatch is re-executed and the multi-table transaction will write the 
same data a second time, which will cause data duplication. At my job, for 
example, we use this kind of logic and we frequently kill our streaming jobs to 
redeploy new code after which we restart them.
    
    So, from my understanding, iceberg is not an idempotent sink and you can't 
expect to have end-to-end exactly once with iceberg?
   
   ### Query engine
   
   Spark


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[I] Make iceberg an idempotent sink for Spark like delta lake [iceberg]

Reply via email to