[GitHub] [iceberg] szehon-ho commented on a diff in pull request #7029: Spark 3.3: Dataset writes for position deletes

via GitHub Mon, 06 Mar 2023 23:19:10 -0800


szehon-ho commented on code in PR #7029:
URL: https://github.com/apache/iceberg/pull/7029#discussion_r1127449707



##########
core/src/main/java/org/apache/iceberg/io/DeleteSchemaUtil.java:
##########
@@ -29,7 +29,7 @@ private static Schema pathPosSchema(Schema rowSchema) {
     return new Schema(
         MetadataColumns.DELETE_FILE_PATH,
         MetadataColumns.DELETE_FILE_POS,
-        Types.NestedField.required(
+        Types.NestedField.optional(

Review Comment:
   Yea you are right, this is tricky.  It says this in spec:
   
   > 
   
   2147483544 row | required struct<...> [1] | Deleted row values. Omit the 
column when not storing deleted rows.
   -- | -- | --
   
   > When present in the delete file, row is required because all delete 
entries must include the row values.
   
   So either, entire position delete file has 'row', or entire file does not 
have 'row'.  (Currently it seems Spark does not set 'row' at all, ref:  
https://github.com/apache/iceberg/blob/master/spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkPositionDeltaWrite.java#L436)
   
   I somehow need a way, when compacting delete files, to know whether the 
original position file all have rows or not.  I am not sure at the moment how 
to get this
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] szehon-ho commented on a diff in pull request #7029: Spark 3.3: Dataset writes for position deletes

Reply via email to