cccs-jc opened a new issue, #8902:
URL: https://github.com/apache/iceberg/issues/8902

   ### Apache Iceberg version
   
   1.3.1
   
   ### Query engine
   
   Spark
   
   ### Please describe the bug 🐞
   
   @singhpk234  I think you might know how to fix this.
   
   The implementation of `streaming-skip-overwrite-snapshots` is not what I 
expected. At this location it does skip over any rewrite snapshots, but only a 
single file at the time.
   
   
https://github.com/apache/iceberg/blob/b1f7008517bf9da0fe4eea6755878a87cf64341d/spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/source/SparkMicroBatchStream.java#L230C9-L230C17
   
   Suppose you trigger every minute and that you encounter a rewrite snapshot 
with 300 files. This means it will take 300 x 1 minute (300 minutes) to finally 
skip over the snapshot and start progressing again.
   
   I think that once a rewrite snapshot is detected we should exhaust all the 
`positions` (all the files) in that commit to position ourselves for the next 
commit.
   
   This is how my `writeStream` is configured.
   
   
   ``` 
       # connect to source table
       df = spark.readStream.format("iceberg")
       if reset_checkpoint:
           # current time in milliseconds
           ts = int(time.time() * 1000)
           print(f"Reading {source_table} from ts {ts}")
           df = df.option("stream-from-timestamp", ts)
   
       df = (
           df
           .option("split-size", 16 * 1024 * 1024)
           .option("streaming-skip-delete-snapshots", True)
           .option("streaming-skip-overwrite-snapshots", True)
           .option("streaming-max-files-per-micro-batch", 200)
           .option("streaming-max-rows-per-micro-batch", 2000000)
           .load(source_table)
           .withWatermark(
               "timestamp", "10 minutes"
           )  # enable watermark so that spark keeps track of 
min/max/avg/watermark eventTime.
           # Note we do not use the watermark to evict rows from an aggregation 
window, only to keeps track of eventTime metrics.
       )
       ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to