[GitHub] [iceberg] JFinis commented on a diff in pull request #8672: Spec: Carify missing fields when writing

via GitHub Thu, 28 Sep 2023 03:44:43 -0700


JFinis commented on code in PR #8672:
URL: https://github.com/apache/iceberg/pull/8672#discussion_r1339915024



##########
format/spec.md:
##########
@@ -128,13 +128,13 @@ Tables do not require rename, except for tables that use 
atomic rename to implem
 
 #### Writer requirements
 
-Some tables in this spec have columns that specify requirements for v1 and v2 
tables. These requirements are intended for writers when adding metadata files 
to a table with the given version.
+Some tables in this spec have columns that specify requirements for v1 and v2 
tables. These requirements are intended for writers when adding 
metadata/manifest files to a table with the given version.
 
-| Requirement | Write behavior |
-|-------------|----------------|
-| (blank)     | The field should be omitted |
-| _optional_  | The field can be written |
-| _required_  | The field must be written |
+| Requirement | Write behavior                                        |
+|-------------|-------------------------------------------------------|
+| (blank)     | The field should not be present in the schema         |
+| _optional_  | The field should be in the schema, and can be written |
+| _required_  | The field should in the schema and must be written    |
 
 Readers should be more permissive because v1 metadata files are allowed in v2 
tables so that tables can be upgraded to v2 without rewriting the metadata 
tree. For manifest list and manifest files, this table shows the expected v2 
read behavior:

Review Comment:
   Before this get's merged we should get an understanding of [the issue I 
raised in 
slack](https://apache-iceberg.slack.com/archives/C03LG1D563F/p1695897384506619?thread_ts=1695834739.711569&cid=C03LG1D563F).
 For reference, here is the issue:
   
   > I have a manifest file, that has no column for `sort_order_id`, even 
though both v1 and v2 define sort order id to be optional
   
   With the clarification in this PR, this file definitly no longer is spec 
conforming. But I'm 99% sure Spark wrote it. So we should do one of the 
following:
   
   * Maybe it's a bug in the spec, and the field was missing in v1 or v2? Then 
we should update the spec.
   * Or implementations did (and maybe still do?) read "optional" not as "it 
has to be present, but should be flagged optional" but rather "it may or may 
not be present". In this case we have two options:
      * We change the wording in this PR and also allow the field to be absent 
from the Avro schema in the optional case.
      * We leave it like that, but change the reader section below to state 
that readers should be prepared that fields declared optional may also be 
completely absent in Avro, due to some writer implementations not conforming to 
the spec. In this case, the reader should treat the field as if it was present 
and declared optional and all values were null.
   > 



##########
format/spec.md:
##########
@@ -128,13 +128,13 @@ Tables do not require rename, except for tables that use 
atomic rename to implem
 
 #### Writer requirements
 
-Some tables in this spec have columns that specify requirements for v1 and v2 
tables. These requirements are intended for writers when adding metadata files 
to a table with the given version.
+Some tables in this spec have columns that specify requirements for v1 and v2 
tables. These requirements are intended for writers when adding 
metadata/manifest files to a table with the given version.
 
-| Requirement | Write behavior |
-|-------------|----------------|
-| (blank)     | The field should be omitted |
-| _optional_  | The field can be written |
-| _required_  | The field must be written |
+| Requirement | Write behavior                                        |
+|-------------|-------------------------------------------------------|
+| (blank)     | The field should not be present in the schema         |
+| _optional_  | The field should be in the schema, and can be written |

Review Comment:
   I think we need to be even more verbose here. There are two different 
"storage formats", namely JSON (for metadata.json) and Avro (Manifest(List)). 
We should mentioned precisely what it means for both.
   
   For example, JSON doesn't have a schema, so the comment "should be in the 
schema" doesn't make sense for JSON. For Avro, it is unclear what it means if a 
field "can be written". It's not about writing fields here. It's about writing 
a value or null (or in iceberg speak: The field in the schema should be 
declared optional) and we should spell this out.
   
   E.g., something along the lines of this:
   
   ```
   | (blank)     | The field should not be present in JSON. It should not be 
present in the schema of Avro files.   |
   | _optional_  | The field may or may not be present in JSON. It should be 
present in the schema of Avro files and it should be declared optional. |
   | _required_  | The field should be present in JSON. It should be present in 
the schema of Avro files and it should be declared required.   |
   ```



##########
format/spec.md:
##########
@@ -128,13 +128,13 @@ Tables do not require rename, except for tables that use 
atomic rename to implem
 
 #### Writer requirements
 
-Some tables in this spec have columns that specify requirements for v1 and v2 
tables. These requirements are intended for writers when adding metadata files 
to a table with the given version.
+Some tables in this spec have columns that specify requirements for v1 and v2 
tables. These requirements are intended for writers when adding 
metadata/manifest files to a table with the given version.

Review Comment:
   ```suggestion
   Some tables in this spec have columns that specify requirements for v1 and 
v2 tables. These requirements are intended for writers when adding metadata, 
manifest files, or manifest lists to a table with the given version.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] JFinis commented on a diff in pull request #8672: Spec: Carify missing fields when writing

Reply via email to