Re: [PR] Add support for orc format [iceberg-python]

via GitHub Tue, 11 Jun 2024 22:43:40 -0700


HonahX commented on code in PR #790:
URL: https://github.com/apache/iceberg-python/pull/790#discussion_r1632769062



##########
pyiceberg/io/pyarrow.py:
##########
@@ -799,11 +802,12 @@ def primitive(self, primitive: pa.DataType) -> T:
 
 
 def _get_field_id(field: pa.Field) -> Optional[int]:
-    return (
-        int(field_id_str.decode())
-        if (field.metadata and (field_id_str := 
field.metadata.get(PYARROW_PARQUET_FIELD_ID_KEY)))
-        else None
-    )
+    if field.metadata and (field_id_str := 
field.metadata.get(PYARROW_ORC_FIELD_ID_KEY)):

Review Comment:
   We may want to add a doc somewhere to mention that ORC read support requires 
pyarrow version >= `13.0.0`, since the orc metadata is exposed in pyarrow 
schema in 13.0.0 https://github.com/apache/arrow/issues/35304
   
   Also, shall we check `PYARROW_PARQUET_FIELD_ID_KEY` first since parquet is 
the default file format?



##########
pyiceberg/io/pyarrow.py:
##########
@@ -912,6 +916,9 @@ def primitive(self, primitive: pa.DataType) -> 
PrimitiveType:
                     return TimestamptzType()
                 elif primitive.tz is None:
                     return TimestampType()
+            if primitive.unit == "ns":
+                if primitive.tz == "UTC":
+                    return TimestamptzType()

Review Comment:
   nanosecond timestamp is added in [version 
3](https://iceberg.apache.org/spec/#version-3), which is still under 
development and not formally adopted. Pyiceberg does not support nanosecond yet 
so I think we should not add the conversion here. (and it should be converted 
to a separate `TimestampNanoType` in the future: 
https://iceberg.apache.org/spec/#primitive-types )
   
   I think you add this because [arrow reads ORC timestamp as 
nanoseconds](https://arrow.apache.org/docs/cpp/orc.html) since ORC's timestamp 
types always contain nanosecond information. 
   
   It seems Java side currently just [treat ORC's timestamp type as `us` unit 
one](https://github.com/apache/iceberg/blob/c07f2aabc0a1d02f068ecf1514d2479c0fbdd3b0/orc/src/main/java/org/apache/iceberg/orc/OrcToIcebergVisitor.java#L170-L177)
   
   We could probably fix this at arrow schema level. For example, we can add an 
additional conversion for `physical_schema` here to change the unit of arrow 
timestamp from `ns` to `us`:
   
https://github.com/apache/iceberg-python/blob/f782c548df31386ba19a15a33ee53a5094886506/pyiceberg/io/pyarrow.py#L987-L989
   Changing the physical schema also ensures that the actual timestamp data is 
read with `us` unit as required by `TimestampType`
   
https://github.com/apache/iceberg-python/blob/f782c548df31386ba19a15a33ee53a5094886506/pyiceberg/io/pyarrow.py#L1002-L1008
   
   However, I also feel it is not the ultimate solution because we assume the 
unit is microsecond. When `TimestampNanoType` is in, we may need to do some 
additional steps to ensure we reads the data using the correct unit.
   
   @MehulBatra @Fokko Would love to hear your thoughts on this. Please correct 
me if I make any mistakes about the ORC's behavior.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Add support for orc format [iceberg-python]

Reply via email to