bobby-richard commented on code in PR #10045:
URL: https://github.com/apache/pinot/pull/10045#discussion_r1062499804


##########
pinot-connectors/pinot-flink-connector/src/main/java/org/apache/pinot/connector/flink/sink/PinotSinkFunction.java:
##########
@@ -151,4 +148,18 @@ private void flush()
       LOG.info("Pinot segment uploaded to {}", segmentURI);
     });
   }
+
+  @Override
+  public List<GenericRow> snapshotState(long checkpointId, long timestamp) {

Review Comment:
   @snleee Let's say we are using a kafka source to feed the pinot sink with 
checkpointing enabled at a 1 minute interval. Every minute when the checkpoint 
is taken, the kafka source will persist its current offsets in the checkpoint. 
If the job fails, it will be restarted from the last checkpoint. The kafka 
source will resume from the last committed offsets, but the pinot sink will 
have lost all pending rows for the current unflushed segment up until that 
point. 
   
   You can rewind data from the source if running in batch mode without 
checkpointing enabled. But if you are running a streaming job with a kafka 
source, it's impossible to match up the kafka offsets of the source with the 
last row of the most recently flushed segment.
   
   I'm not sure how you can support checkpointing in the sink without storing 
the pending rows in state.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

Reply via email to