[GitHub] [iceberg] nastra commented on a diff in pull request #6569: Spark: Add the query ID to file names

GitBox Thu, 12 Jan 2023 09:01:37 -0800


nastra commented on code in PR #6569:
URL: https://github.com/apache/iceberg/pull/6569#discussion_r1068373480



##########
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/SparkPositionDeltaWrite.java:
##########
@@ -335,6 +335,7 @@ public DeltaWriter<InternalRow> createWriter(int 
partitionId, long taskId) {
       OutputFileFactory dataFileFactory =
           OutputFileFactory.builderFor(table, partitionId, taskId)
               .format(context.dataFileFormat())
+              .operationId(context.queryId())

Review Comment:
   
https://github.com/apache/iceberg/blob/8c6adf6e5e17603025d23b2012aa576c071ff269/core/src/main/java/org/apache/iceberg/io/OutputFileFactory.java#L90
 shows how the file name is being determined, and in the cases where the data 
file was overwritten, `partitionId / taskId / operationId` were all the same 
(since we manually set the same `operationId` as for data files - previously 
the `operationId` for data+delete files was randomly generated).
   
   Maybe we could add a different suffix into the name generation to indicate 
that it's a data/delete file (although I'm not sure if there are any 
implications to this)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] nastra commented on a diff in pull request #6569: Spark: Add the query ID to file names

Reply via email to