cccs-jc opened a new issue, #8902: URL: https://github.com/apache/iceberg/issues/8902
### Apache Iceberg version 1.3.1 ### Query engine Spark ### Please describe the bug 🐞 @singhpk234 I think you might know how to fix this. The implementation of `streaming-skip-overwrite-snapshots` is not what I expected. At this location it does skip over any rewrite snapshots, but only a single file at the time. https://github.com/apache/iceberg/blob/b1f7008517bf9da0fe4eea6755878a87cf64341d/spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/source/SparkMicroBatchStream.java#L230C9-L230C17 Suppose you trigger every minute and that you encounter a rewrite snapshot with 300 files. This means it will take 300 x 1 minute (300 minutes) to finally skip over the snapshot and start progressing again. I think that once a rewrite snapshot is detected we should exhaust all the `positions` (all the files) in that commit to position ourselves for the next commit. This is how my `writeStream` is configured. ``` # connect to source table df = spark.readStream.format("iceberg") if reset_checkpoint: # current time in milliseconds ts = int(time.time() * 1000) print(f"Reading {source_table} from ts {ts}") df = df.option("stream-from-timestamp", ts) df = ( df .option("split-size", 16 * 1024 * 1024) .option("streaming-skip-delete-snapshots", True) .option("streaming-skip-overwrite-snapshots", True) .option("streaming-max-files-per-micro-batch", 200) .option("streaming-max-rows-per-micro-batch", 2000000) .load(source_table) .withWatermark( "timestamp", "10 minutes" ) # enable watermark so that spark keeps track of min/max/avg/watermark eventTime. # Note we do not use the watermark to evict rows from an aggregation window, only to keeps track of eventTime metrics. ) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org