davidrobbo commented on issue #12444: URL: https://github.com/apache/iceberg/issues/12444#issuecomment-2838964017
@singhpk234 perhaps you have some understanding of a related issue. From digging into the source code, it also appears that readStream from a Iceberg table (at least with my setup as above) can not progress above processing Integer.MAX_VALUE in total across > 1 micro batches. It appears the first snapshot ID observed by the streaming query is written to `sources/0/offsets/0` (i.e.), and then during stream initialisation, the following check prevents file count or row count exceeding that which is configured: https://github.com/apache/iceberg/blob/e1ab42591488f25471a850547da9952aa9de48c2/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkMicroBatchStream.java#L375 And it's only later during planning the next micro batch where offsets (i.e. the last processed snapshot ID) is used to identify non-processed files/rows in incremental processing. https://github.com/apache/iceberg/blob/e1ab42591488f25471a850547da9952aa9de48c2/core/src/main/java/org/apache/iceberg/IncrementalDataTableScan.java#L64) Do you happen to have any insight on this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org