amitmittal5 opened a new issue, #9172: URL: https://github.com/apache/iceberg/issues/9172
### Apache Iceberg version 1.4.2 (latest release) ### Query engine Spark ### Please describe the bug 🐞 I have a spark streaming job which reads the data from ADLS gen 2 and write to iceberg table as per following steps: **Step 1: Create the table** ``` CREATE TABLE IF NOT EXISTS default.blob_iceberg (id string, state string, name string) USING ICEBERG LOCATION 'abfss://<container>@<storage-account>.dfs.core.windows.net/test/blob_iceberg' ``` Run a spark streaming scala job with AvailableNow trigger (same behavior with ProcessingTime("10 seconds") trigger): ``` val sourceFilePath = "abfss://<container>@<storage-account>.dfs.core.windows.net/test/source" val schema = spark.read.format("csv").option("header", "true").load(s"${sourceFilePath}/Sample.txt").schema val checkpointPath = "abfss://<container>@<storage-account>.dfs.core.windows.net/test/blob_iceberg_checkpoint" val sourceDF = spark .readStream .schema(schema) .format("csv") .option("header", "true") .option("sep",",") .load(sourceFilePath) sourceDF .writeStream .format("iceberg") .outputMode("append") .trigger(Trigger.AvailableNow) .option("checkpointLocation", checkpointPath) .toTable("default.blob_iceberg") ``` The behavior observed that for the 1st execution, parquet files are created under data directory in which the parquet files are named like `00000-2-852c47ed-881c-4cac-8d9f-230da7873d05-00001.parquet` in which "852c47ed-881c-4cac-8d9f-230da7873d05" is the spark streaming id from checkpoint metadata file. When the same job is executed multiple times, with new data in source directory, the streaming job sometimes overwrites one or more existing parquet file(s). In this test, the file /data/00000-2-852c47ed-881c-4cac-8d9f-230da7873d05-00001.parquet got overwritten with new data, so original 4 records are lost and only 2 new records are part of that file. Here is the screenshot of query `select substring(file_path, 88), record_count from default.blobiceberg16.files` This also makes iceberg metadata and actual data files out-of-sync. **Environment**: Runtime: iceberg-spark-runtime-3.4_2.12-1.4.2 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org