mudit-97 commented on PR #9694: URL: https://github.com/apache/iceberg/pull/9694#issuecomment-1969366948
sure @pvary, we wanted single phase commit solution because of this thought process: 1. We are writing a Pubsub source operator which will ack the message on notifyCheckpointComplete 2. If 2PC is used, then notifyCheckpointComplete will be called parallely and there is no guarantee the messages which are acked in PubSub are even written to Iceberg or not, they might still be in the checkpoint directory 3. If during any time, job goes down we have to take care of managing the checkpoints always and resuming job from checkpoint. If checkpoints are corrupted, we will have to seek back the PubSub operator 4. Apart from all of this, PubSub metrics / any source operator metric will never give a consistent view as acked messages can still be lying in checkpoint directory instead of lying in sink We understand there can be duplication of messages in this case, but for some cases we believe duplication would be okay instead of managing checkpoints and taking care of corruption in them and also maintaining consistent metrics along the way especially metrics like watermarks Thats why we wanted to keep this behavior behind a flag so that consumers can choose to have it if needed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org