Re: [I] Parquet file overwritten by spark streaming job in subsequent execution with same spark streaming checkpoint location [iceberg]

via GitHub Fri, 08 Dec 2023 20:14:49 -0800


amogh-jahagirdar commented on issue #9172:
URL: https://github.com/apache/iceberg/issues/9172#issuecomment-1848220358


   Thanks for the details, one key thing stands out to me:
   
   ```
   I also tested with latest version, iceberg-spark-runtime-3.4_2.12-1.4.2.jar 
as well, I could see that the second number, part of the file name, is 
continuously increasing 
00001-3200-11773075-523f-4667-936b-88702fe9860c-00001.parquet, however after 
around 200 execution of stream, the file name got reset 
00001-3166-11773075-523f-4667-936b-88702fe9860c-00001.parquet and files were 
started getting overwritten.
   ```
   
   This does align with the suspicion in the other issue that task IDs can be 
reused across epochs ("after around 200 executions of stream" I'm reading that 
as 200 intervals of miccrobatches)
   
    Which I think makes sense (and anyways that's probably intentional in the 
DSV2 API to surface the writer). I'll put up a draft for adding the epochID to 
the output path. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Parquet file overwritten by spark streaming job in subsequent execution with same spark streaming checkpoint location [iceberg]

Reply via email to