Re: [I] Spark readStream not progresing [iceberg]

via GitHub Tue, 29 Apr 2025 06:39:43 -0700


davidrobbo commented on issue #12444:
URL: https://github.com/apache/iceberg/issues/12444#issuecomment-2838964017


   @singhpk234 perhaps you have some understanding of a related issue.
   
   From digging into the source code, it also appears that readStream from a 
Iceberg table (at least with my setup as above) can not progress above 
processing Integer.MAX_VALUE in total across > 1 micro batches.
   
   It appears the first snapshot ID observed by the streaming query is written 
to `sources/0/offsets/0` (i.e.), and then during stream initialisation, the 
following check prevents file count or row count exceeding that which is 
configured:
   
   
https://github.com/apache/iceberg/blob/e1ab42591488f25471a850547da9952aa9de48c2/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkMicroBatchStream.java#L375
   
   And it's only later during planning the next micro batch where offsets (i.e. 
the last processed snapshot ID) is used to identify non-processed files/rows in 
incremental processing.
   
   
https://github.com/apache/iceberg/blob/e1ab42591488f25471a850547da9952aa9de48c2/core/src/main/java/org/apache/iceberg/IncrementalDataTableScan.java#L64)
   
   Do you happen to have any insight on this?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Spark readStream not progresing [iceberg]

Reply via email to