sanchay0 opened a new issue, #11894:
URL: https://github.com/apache/iceberg/issues/11894

   ### Query engine
   
   Flink
   
   ### Question
   
   I am running a Flink job that reads data from Kafka, processes it into a 
Flink [Row 
object](https://nightlies.apache.org/flink/flink-docs-master/api/java/org/apache/flink/types/Row.html),
 and writes it to an [Iceberg 
Sink](https://iceberg.apache.org/javadoc/1.3.0/org/apache/iceberg/flink/sink/FlinkSink.Builder.html#append--).
 To deploy new code changes, I restart the job from the latest savepoint, which 
is committed at the Kafka source. This savepoint stores the most recent Kafka 
offset read by the job, allowing the subsequent run to continue from the 
previous offset, ensuring transparent failure recovery. Recently, I encountered 
a scenario where data was lost in the final output after a job restart. Here's 
what happened:
   
   - The job reads from Kafka starting at offset 1 and writes the data to an 
Iceberg table backed by S3.
   - It then reads from offset 2, but the writes to the Iceberg table are 
delayed due to backpressure or network issues.
   - The job proceeds to read from offset 3.
   - Before the data from offset 2 is written to S3, I restart the job. After 
the restart, the job begins reading from offset 3, resulting in the loss of 
data from offset 2, which was never written to S3.
   
   Is there a workaround for this problem, or is it an inherent limitation of 
the Iceberg Sink?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to