syun64 commented on code in PR #848:
URL: https://github.com/apache/iceberg-python/pull/848#discussion_r1667216339
##########
pyiceberg/io/pyarrow.py:
##########
@@ -918,11 +919,24 @@ def primitive(self, primitive: pa.DataType) ->
PrimitiveType:
return TimeType()
elif pa.types.is_timestamp(primitive):
primitive = cast(pa.TimestampType, primitive)
- if primitive.unit == "us":
- if primitive.tz == "UTC" or primitive.tz == "+00:00":
- return TimestamptzType()
- elif primitive.tz is None:
- return TimestampType()
+ if primitive.unit in ("s", "ms", "us"):
+ # Supported types, will be upcast automatically to 'us'
+ pass
+ elif primitive.unit == "ns":
+ if Config().get_bool("downcast-ns-timestamp-on-write"):
Review Comment:
> I also got some issues with the nanosecond timestamp when collecting
statistics:
>
> ```
> > ???
> E ValueError: Nanosecond resolution temporal type 1615967687249846175 is
not safely convertible to microseconds to convert to datetime.datetime. Install
pandas to return as Timestamp with nanosecond support or access the .value
attribute.
> ```
>
> At the lines:
>
>
https://github.com/apache/iceberg-python/blob/7afd6d677d0fbc73d6baa8a5b57bdd889d0afe87/pyiceberg/io/pyarrow.py#L1870-L1871
>
> This got fixed after updating this to:
>
> ```python
> col_aggs[field_id].update_min(statistics.min_raw)
> col_aggs[field_id].update_max(statistics.max_raw)
> ```
I tried making this change and realized that this causes our serialization
to break because it introduces `bytes` values in our
[statistics](https://arrow.apache.org/docs/python/generated/pyarrow.parquet.Statistics.html#pyarrow.parquet.Statistics.min_raw),
which cannot be serialized (since it already is). I will need to spend a bit
more time to figure out the right change to StatsAggregator to support this
change. I also failed to reproduce this issue in my environment (possibly
because it has pandas installed) so I'm reverting this change for now.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]