sanchay0 opened a new issue, #11894: URL: https://github.com/apache/iceberg/issues/11894
### Query engine Flink ### Question I am running a Flink job that reads data from Kafka, processes it into a Flink [Row object](https://nightlies.apache.org/flink/flink-docs-master/api/java/org/apache/flink/types/Row.html), and writes it to an [Iceberg Sink](https://iceberg.apache.org/javadoc/1.3.0/org/apache/iceberg/flink/sink/FlinkSink.Builder.html#append--). To deploy new code changes, I restart the job from the latest savepoint, which is committed at the Kafka source. This savepoint stores the most recent Kafka offset read by the job, allowing the subsequent run to continue from the previous offset, ensuring transparent failure recovery. Recently, I encountered a scenario where data was lost in the final output after a job restart. Here's what happened: - The job reads from Kafka starting at offset 1 and writes the data to an Iceberg table backed by S3. - It then reads from offset 2, but the writes to the Iceberg table are delayed due to backpressure or network issues. - The job proceeds to read from offset 3. - Before the data from offset 2 is written to S3, I restart the job. After the restart, the job begins reading from offset 3, resulting in the loss of data from offset 2, which was never written to S3. Is there a workaround for this problem, or is it an inherent limitation of the Iceberg Sink? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org