[GitHub] [iceberg] kongul opened a new issue, #7890: Data files name collision written by Spark Streaming job after it's restart

via GitHub Fri, 23 Jun 2023 05:04:25 -0700


kongul opened a new issue, #7890:
URL: https://github.com/apache/iceberg/issues/7890


   ### Apache Iceberg version
   
   1.2.1
   
   ### Query engine
   
   Spark
   
   ### Please describe the bug 🐞
   
   We have number of Spark jobs that do stream data to Iceberg tables. Recently 
we faced issue reading those tables - data files were deleted or overridden by 
other data files with different size (checked older version in s3 bucket). 
After Investigation this i what we found.
   
   Here's how filename is constructed 
https://github.com/apache/iceberg/blob/apache-iceberg-1.2.1/core/src/main/java/org/apache/iceberg/io/OutputFileFactory.java#L51-L100
   As it said there 
   
   ```
      * Constructor with specific operationId. The [partitionId, taskId, 
operationId] triplet has to be
      * unique across JVM instances otherwise the same file name could be 
generated by different
      * instances of the OutputFileFactory.
   ```
   
   Here we can see that `queryId` is passed as `operationId`
   
   Now let's see what is passed there from Spark side
   
https://github.com/apache/spark/blob/branch-3.3/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala#L159
   
https://github.com/apache/spark/blob/branch-3.3/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala#L134C1-L143
   
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamMetadata.scala
   
   So stream metadata file contain in queryId is persisted across Spark 
Streaming Jobs restarts, hence your requirement `The [partitionId, taskId, 
operationId] triplet has to be unique` is violatet. So new streaming job run 
can generate the same filename that already exists and override exiting file.
   
   
https://github.com/apache/iceberg/blob/apache-iceberg-1.2.1/core/src/main/java/org/apache/iceberg/io/OutputFileFactory.java#L91-L100
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] kongul opened a new issue, #7890: Data files name collision written by Spark Streaming job after it's restart

Reply via email to