HonahX commented on code in PR #790: URL: https://github.com/apache/iceberg-python/pull/790#discussion_r1632769062
########## pyiceberg/io/pyarrow.py: ########## @@ -799,11 +802,12 @@ def primitive(self, primitive: pa.DataType) -> T: def _get_field_id(field: pa.Field) -> Optional[int]: - return ( - int(field_id_str.decode()) - if (field.metadata and (field_id_str := field.metadata.get(PYARROW_PARQUET_FIELD_ID_KEY))) - else None - ) + if field.metadata and (field_id_str := field.metadata.get(PYARROW_ORC_FIELD_ID_KEY)): Review Comment: We may want to add a doc somewhere to mention that ORC read support requires pyarrow version >= `13.0.0`, since the orc metadata is exposed in pyarrow schema in 13.0.0 https://github.com/apache/arrow/issues/35304 Also, shall we check `PYARROW_PARQUET_FIELD_ID_KEY` first since parquet is the default file format? ########## pyiceberg/io/pyarrow.py: ########## @@ -912,6 +916,9 @@ def primitive(self, primitive: pa.DataType) -> PrimitiveType: return TimestamptzType() elif primitive.tz is None: return TimestampType() + if primitive.unit == "ns": + if primitive.tz == "UTC": + return TimestamptzType() Review Comment: nanosecond timestamp is added in [version 3](https://iceberg.apache.org/spec/#version-3), which is still under development and not formally adopted. Pyiceberg does not support nanosecond yet so I think we should not add the conversion here. (and it should be converted to a separate `TimestampNanoType` in the future: https://iceberg.apache.org/spec/#primitive-types ) I think you add this because [arrow reads ORC timestamp as nanoseconds](https://arrow.apache.org/docs/cpp/orc.html) since ORC's timestamp types always contain nanosecond information. It seems Java side currently just [treat ORC's timestamp type as `us` unit one](https://github.com/apache/iceberg/blob/c07f2aabc0a1d02f068ecf1514d2479c0fbdd3b0/orc/src/main/java/org/apache/iceberg/orc/OrcToIcebergVisitor.java#L170-L177) We could probably fix this at arrow schema level. For example, we can add an additional conversion for `physical_schema` here to change the unit of arrow timestamp from `ns` to `us`: https://github.com/apache/iceberg-python/blob/f782c548df31386ba19a15a33ee53a5094886506/pyiceberg/io/pyarrow.py#L987-L989 Changing the physical schema also ensures that the actual timestamp data is read with `us` unit as required by `TimestampType` https://github.com/apache/iceberg-python/blob/f782c548df31386ba19a15a33ee53a5094886506/pyiceberg/io/pyarrow.py#L1002-L1008 However, I also feel it is not the ultimate solution because we assume the unit is microsecond. When `TimestampNanoType` is in, we may need to do some additional steps to ensure we reads the data using the correct unit. @MehulBatra @Fokko Would love to hear your thoughts on this. Please correct me if I make any mistakes about the ORC's behavior. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org