hililiwei commented on PR #6253:
URL: https://github.com/apache/iceberg/pull/6253#issuecomment-1330024532
Hi @stevenzwu
> What do you mean round robin?
I mean, downstream applications need to get the watermark of the table at
intervals to determine whether to start processing. For example, get the latest
watermark every five minutes.
> How could downstream consumers know when a new partition show up?
The writer determines whether to commit new partitions to the table based on
a combination of triggers and policies. Once a new partition is committed, it
means that the writer considers the new partition's data ready. For details:
https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/connectors/table/filesystem/#partition-commit.
Downstream applications read tables using incremental read scan. When a new
snapshot (including a new partition and data) is commited by the upstream, the
downstream can get it.
> can you elaborate the differences btw watermark and new complete partition
for downstream consumers?
Take a scenario in our current production as an example. When my table is
partitioned by hour, if the data in [12:00,13:00) is not completely written, I
do not want consumers to see it when querying the table. The query action may
be an ad hoc query or a statistics task. I want to commit the snapshot after
all the data in [12:00, 13:00) is written instead of commting the snapshot at
each checkpoint.
Thx
Liwei
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]