[PR] PartitionKey [iceberg-python]

via GitHub Tue, 20 Feb 2024 13:56:10 -0800


jqin61 opened a new pull request, #453:
URL: https://github.com/apache/iceberg-python/pull/453


   **Scope**
   Add PartitionKey class which:
   1. is used to hold the raw partition field and values. This is for 
partitioned write.
   2. converts the python values into iceberg-typed values.
   3. applies transform to the iceberg-typed values based on the partition spec.
   4. generates partition path based on transform
   
   **Tests**
   Object Under Test:
   1. PartitionKey could generate the hive partition part of the parquet path 
as spark does
   2. PartitionKey could form the partition as a Record as spark does. (The 
Record is used in metadata writing)
   To achieve the goal of comparison against spark, expected path and expected 
partition are justified against 2 counterpart spark sqls of creating 
partitioned table and data insertion. 
   
   With such justifications, we found these discrepancies between the 
underlying utility functions in Pyiceberg and the existing spark behavior:
   1. For boolean type partition, spark writes the hive partition part of the 
path as "field=false/true" while Pyiceberg (from current underlying utilities) 
writes as "field=False/True". This difference comes from Python boolean is 
capitalized. 
   2. Spark writes the path conforming to URL format, meaning, in the value 
part after 'field=',  any characters of '=' is replaced by "%3D" and ":" is 
replaced by "%3A" and etc. Shall we apply urllib.parse.quote to conform to 
spark behavior? 
   3. For timestamp(tz) type, spark writes the hive partition part of the path 
as "2023-01-01T12%3A00%3A01Z", with %3A representing the ':', the timestamp 
ends with Z while existing Pyiceberg utilities use 
   ```(EPOCH_TIMESTAMP + 
timedelta(microseconds=timestamp_micros)).isoformat()```
   to write, which does not have 'Z'
   4. For float and double
   A partitioned float field with value of 3.14 would end up in the manifest 
entry in the manifest file as Record[double_field=3.140000104904175]. So far 
for Pyiceberg, we are doing it as Record[double_field=3.14] which i think is 
better.
   
   For these discrepancies, should we conform to spark's behaviors?
   
   
   In terms of how PartitionKey is used, please check this PR for [partitioned 
write support] (https://github.com/apache/iceberg-python/pull/353). I separate 
this PR out of the partitioned write to make the latter more manageable, but 
willing to combine the 2 if suggested.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[PR] PartitionKey [iceberg-python]

Reply via email to