[GitHub] [iceberg] Fokko commented on a diff in pull request #8672: Spec: Clarify missing fields when writing

via GitHub Thu, 28 Sep 2023 04:30:26 -0700


Fokko commented on code in PR #8672:
URL: https://github.com/apache/iceberg/pull/8672#discussion_r1339973796



##########
format/spec.md:
##########
@@ -128,13 +128,13 @@ Tables do not require rename, except for tables that use 
atomic rename to implem
 
 #### Writer requirements
 
-Some tables in this spec have columns that specify requirements for v1 and v2 
tables. These requirements are intended for writers when adding metadata files 
to a table with the given version.
+Some tables in this spec have columns that specify requirements for v1 and v2 
tables. These requirements are intended for writers when adding 
metadata/manifest files to a table with the given version.
 
-| Requirement | Write behavior |
-|-------------|----------------|
-| (blank)     | The field should be omitted |
-| _optional_  | The field can be written |
-| _required_  | The field must be written |
+| Requirement | Write behavior                                        |
+|-------------|-------------------------------------------------------|
+| (blank)     | The field should not be present in the schema         |
+| _optional_  | The field should be in the schema, and can be written |
+| _required_  | The field should in the schema and must be written    |
 
 Readers should be more permissive because v1 metadata files are allowed in v2 
tables so that tables can be upgraded to v2 without rewriting the metadata 
tree. For manifest list and manifest files, this table shows the expected v2 
read behavior:

Review Comment:
   > Before this gets merged we should get an understanding of [the issue I 
raised in 
slack](https://apache-iceberg.slack.com/archives/C03LG1D563F/p1695897384506619?thread_ts=1695834739.711569&cid=C03LG1D563F).
   
   There is no rush in merging this asap, the point is to get some clarity from 
the community and try to clarify the spec so that the engines can follow it.
   
   > I have a manifest file, that has no column for sort_order_id, even though 
both v1 and v2 define sort order id to be optional
   
   I just checked locally with Spark 3.3.1 and Iceberg 1.3.1 and the 
`sort_order_id` field is there 🤔 
   
   **For writing**: We can update the spec to avoid misinterpretation. Also, 
for implementations out there, we can fix this and point to the spec to point 
out that the behavior is incorrect.
   
   **For reading**: If this is the case, our reader should always take this 
situation into account that an optional column can be missing. So your reader 
should be able to handle this (and this is the case for most Avro readers) As 
mentioned on Slack, if you use the Iceberg reader itself, then this behavior is 
built in. In Iceberg you can write data, update the schema, and write more data 
with an optional column. The old data doesn't get rewritten, so you have to 
take care of this in your read schema anyway. 
   
   > We leave it like that, but change the reader section below to state that 
readers should be prepared that fields declared optional may also be completely 
absent in Avro, due to some writer implementations not conforming to the spec. 
In this case, the reader should treat the field as if it was present and 
declared optional and all values were null.
   
   I think this is the correct path.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] Fokko commented on a diff in pull request #8672: Spec: Clarify missing fields when writing

Reply via email to