[I] [Spark] Identity partition on required column generates nullable partition tuple in manifest file [iceberg]

via GitHub Fri, 11 Oct 2024 05:42:47 -0700


mosenberg opened a new issue, #11300:
URL: https://github.com/apache/iceberg/issues/11300


   ### Apache Iceberg version
   
   None
   
   ### Query engine
   
   Spark
   
   ### Please describe the bug 🐞
   
   The issue repros using the following SQL:
   ```sql
   CREATE TABLE iceberg.NullabilityPartition(
     group STRING NOT NULL,
     val INTEGER
   )
   Partitioned BY (group)
   TBLPROPERTIES(`format-version`=2,
                 `write.parquet.compression-codec`='snappy',
                  write.delete.format.default='parquet',
                  write.delete.mode='merge-on-read',
                  write.update.mode='merge-on-read',
                  write.merge.mode='merge-on-read');
   
   INSERT INTO iceberg.NullabilityPartition Select * from VALUES 
('foo',1),('foo',2);
   ```
   
   As per the above SQL, the column `group` is defined as `NOT NULL` (i.e. 
`required`) column in the Iceberg metadata schema. However, in the generated 
avro manifest file, the partition tuple - which stores the value of the `group` 
column by which the table is identity-partitioned - the partition value is 
stored as an avro union type ["null", "string"].
   
   As per my understanding of the Iceberg spec, this is not correct:
   The output value of an identity [partition 
transform](https://iceberg.apache.org/spec/#partition-transforms) is equal to 
the source type - in this case `STRING NOT NULL`.
   The section on [manifest files](https://iceberg.apache.org/spec/#manifests) 
further states:
   > Partition data tuple, schema based on the partition spec output
   > using partition field ids for the struct field ids
   
   Hence the schema of the partition tuple should be `"string"` and not 
`["null","string"]`.
   
   ### Willingness to contribute
   
   - [ ] I can contribute a fix for this bug independently
   - [X] I would be willing to contribute a fix for this bug with guidance from 
the Iceberg community
   - [ ] I cannot contribute a fix for this bug at this time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[I] [Spark] Identity partition on required column generates nullable partition tuple in manifest file [iceberg]

Reply via email to