Re: [PR] Spec: Clarify identity partition edge cases. [iceberg]

via GitHub Sun, 04 Aug 2024 12:19:41 -0700


rdblue commented on code in PR #10835:
URL: https://github.com/apache/iceberg/pull/10835#discussion_r1703358973



##########
format/spec.md:
##########
@@ -241,7 +245,14 @@ Struct evolution requires the following rules for default 
values:
 
 #### Column Projection
 
-Columns in Iceberg data files are selected by field id. The table schema's 
column names and order may change after a data file is written, and projection 
must be done using field ids. If a field id is missing from a data file, its 
value for each row should be `null`.
+Columns in Iceberg data files are selected by field id. The table schema's 
column names and order may change after a data file is written, and projection 
must be done using field ids.
+
+Values for field ids which are not present in a data file must be resolved 
according the following rules:
+
+* Return the value from partition metadata if an [Identity 
Transform](#partition-transforms) exists for the field and the partition value 
is present in the `partition` struct on `data_file` object in the manifest. 

Review Comment:
   @findepi, the spec originally stated that all columns that were not present 
in a file must be interpreted as `null` because the file is the source of truth 
for row data -- if a column was not present then it must not have existed at 
the time the file was written and must default to `null` because that was the 
only possible default for new columns.
   
   This is how identity partition columns were always required to be written 
because omitting them would create a conflict between the partition value in 
metadata and the source of truth in the file. Now, the requirement to write all 
columns is more explicitly stated above.
   
   Over time, we added support for reading Hive files that were missing the 
identity partitioned columns. This now documents how to do that in the read 
path but we didn't drop the requirement to write all columns in the write path.
   
   In addition, we've now added column defaults for v3 and didn't realize that 
this section conflicted with the `initial-default` behavior.
   
   This is just stating more clearly what the expected behavior is and I think 
it's a good change because it is clear.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Spec: Clarify identity partition edge cases. [iceberg]

Reply via email to