[I] Iceberg Spark streaming skips rows of data [iceberg]

via GitHub Tue, 16 Apr 2024 10:21:05 -0700


cccs-jc opened a new issue, #10156:
URL: https://github.com/apache/iceberg/issues/10156


   ### Apache Iceberg version
   
   1.5.0 (latest release)
   
   ### Query engine
   
   Spark
   
   ### Please describe the bug 🐞
   
   When using spark readStream the option `stream-from-timestamp` is used to 
position the read at the specified timestamp. The query below uses the current 
time to read from the head of the queue.
   ```python
   ts = int(time.time() * 1000)
   df = spark.readStream.format("iceberg")
           .option("streaming-skip-delete-snapshots", True)
           .option("streaming-skip-overwrite-snapshots", True)
           .option("streaming-max-files-per-micro-batch", max_files)
           .option("streaming-max-rows-per-micro-batch", max_rows)
           .option("stream-from-timestamp", ts)
           .load(source_table)
   ```
   
   You can kill your streaming job and wait 10 minutes. Then start it again. 
The readStream will load the checkpointed offset from disk and is supposed to 
read from that offset. However, there is a bug that cause it to skip the 
commits that occurred in that 10 minutes and instead the readStream reads from 
the latest commit.
   
   I can work around this bug by not specifying the  `stream-from-timestamp` if 
the query uses the checkpointed offset.
   ```python
       ts = int(time.time() * 1000)
       use_ts = checkpoint_dir_exists(checkpoint_location) == False
       df = ( 
           get_spark().readStream.format("iceberg")
           .option("streaming-skip-delete-snapshots", True)
           .option("streaming-skip-overwrite-snapshots", True)
           .option("streaming-max-files-per-micro-batch", max_files)
           .option("streaming-max-rows-per-micro-batch", max_rows)
           )
       if use_ts:
           df = df.option("stream-from-timestamp", ts)
       df = df.load(source_table)
   ```
   But this is error prone. As a user I expect the readStream to continue from 
the last checkpointed offset.
   
   I suspect the issue might be here. 
https://github.com/apache/iceberg/blob/fc5b2b336c774b0b8b032f7d87a1fb21e76b3f20/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkMicroBatchStream.java#L326
   
   This seems to short-circuit the saved `startOffset`  by checking of the last 
snapshot in the table is older than the requested `stream-from-timestamp`. 
   
   @singhpk234 I have not stepped through the code to be sure. Is my theory 
possible?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[I] Iceberg Spark streaming skips rows of data [iceberg]

Reply via email to