bobby-richard commented on code in PR #10045: URL: https://github.com/apache/pinot/pull/10045#discussion_r1062499804
########## pinot-connectors/pinot-flink-connector/src/main/java/org/apache/pinot/connector/flink/sink/PinotSinkFunction.java: ########## @@ -151,4 +148,18 @@ private void flush() LOG.info("Pinot segment uploaded to {}", segmentURI); }); } + + @Override + public List<GenericRow> snapshotState(long checkpointId, long timestamp) { Review Comment: @snleee Let's say we are using a kafka source to feed the pinot sink with checkpointing enabled at a 1 minute interval. Every minute when the checkpoint is taken, the kafka source will persist its current offsets in the checkpoint. If the job fails, it will be restarted from the last checkpoint. The kafka source will resume from the last committed offsets, but the pinot sink will have lost all pending rows for the current unflushed segment up until that point. You can rewind data from the source if running in batch mode without checkpointing enabled. But if you are running a streaming job with a kafka source, it's impossible to match up the kafka offsets of the source with the last row of the most recently flushed segment. I'm not sure how you can support checkpointing in the sink without storing the pending rows in state. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org