shanzi opened a new issue, #15867:
URL: https://github.com/apache/iceberg/issues/15867

   ### Apache Iceberg version
   
   1.10.1 (latest release)
   
   ### Query engine
   
   Flink
   
   ### Please describe the bug 🐞
   
   In the [watermark 
emitter](https://github.com/apache/iceberg/blob/3e22f850bd21f3f8f4ecb340a5d4ffe468fb1ced/flink/v2.1/flink/src/main/java/org/apache/iceberg/flink/source/reader/WatermarkExtractorRecordEmitter.java#L57),
 it seems we are emitting min timestamp extracted in the 
[ColumnStatsWatermarkExtractor](https://github.com/apache/iceberg/blob/3e22f850bd21f3f8f4ecb340a5d4ffe468fb1ced/flink/v2.0/flink/src/main/java/org/apache/iceberg/flink/source/reader/ColumnStatsWatermarkExtractor.java#L39).
   
   However, in Flink watermark means events less than or *equal to* the value 
has already been processed. Thus, if there are many records just have the 
minimum timestamp in the split and there is a subsequent stateful operator that 
will drop late records. Those records will be treated as late.
   
   Thus, we should emit min metrics minus one for every split.
   
   ### Willingness to contribute
   
   - [ ] I can contribute a fix for this bug independently
   - [ ] I would be willing to contribute a fix for this bug with guidance from 
the Iceberg community
   - [ ] I cannot contribute a fix for this bug at this time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to