HonahX commented on code in PR #848:
URL: https://github.com/apache/iceberg-python/pull/848#discussion_r1665113108
##########
pyiceberg/io/pyarrow.py:
##########
@@ -918,11 +919,24 @@ def primitive(self, primitive: pa.DataType) ->
PrimitiveType:
return TimeType()
elif pa.types.is_timestamp(primitive):
primitive = cast(pa.TimestampType, primitive)
- if primitive.unit == "us":
- if primitive.tz == "UTC" or primitive.tz == "+00:00":
- return TimestamptzType()
- elif primitive.tz is None:
- return TimestampType()
+ if primitive.unit in ("s", "ms", "us"):
+ # Supported types, will be upcast automatically to 'us'
+ pass
+ elif primitive.unit == "ns":
+ if Config().get_bool("downcast-ns-timestamp-on-write"):
Review Comment:
Thanks for all the valuable discussion.
Sorry about the typo I made in the very first comment, it should be
`pyarrow_to_schema`. Apologies if it creates any confusion.
> I actually don't think it'll stop when it reads through
to_requested_schema, because it will detect that the pyarrow types are
different, but their IcebergTypes are the same and silently cast on read, which
will drop the precision silently
I got the type conversion error
https://github.com/apache/iceberg-python/blob/b8c5bb77c5ea436aeced17676aa30d09c1224ed9/pyiceberg/io/pyarrow.py#L1278
If the timestamp value's nanosecond part is not empty
<details>
<summary>Example code that re-produce the issue (modified from a test in
`test_add_files.py`): </summary>
```python
ARROW_SCHEMA = pa.schema([
("foo", pa.bool_()),
("bar", pa.string()),
("baz", pa.int32()),
("qux", pa.date32()),
("quux", pa.timestamp("ns", tz="UTC")),
])
ARROW_TABLE = pa.Table.from_pylist(
[
{
"foo": True,
"bar": "bar_string",
"baz": 123,
"qux": date(2024, 3, 7),
"quux": 1615967687249846175, # 2021-03-17 07:54:47.249846159
}
],
schema=ARROW_SCHEMA,
)
@pytest.mark.integration
def test_timestamp_tz(
session_catalog: Catalog, format_version: int, mocker: MockerFixture
) -> None:
mocker.patch.dict(os.environ,
values={"PYICEBERG_DOWNCAST_NS_TIMESTAMP_ON_WRITE": "True"})
identifier = f"default.unpartitioned_raises_not_found_v{format_version}"
tbl = _create_table(session_catalog, identifier, format_version)
file_paths =
[f"s3://warehouse/default/unpartitioned_raises_not_found/v{format_version}/test-{i}.parquet"
for i in range(5)]
# write parquet files
for file_path in file_paths:
fo = tbl.io.new_output(file_path)
with fo.create(overwrite=True) as fos:
with pq.ParquetWriter(fos, schema=ARROW_SCHEMA) as writer:
writer.write_table(ARROW_TABLE)
# add the parquet files as data files
tbl.add_files(file_paths=file_paths)
print(tbl.scan().to_arrow())
```
</details>
This is just an edge case that may better be resolved by having a check in
`add_files`. I just wanted to use this as an example to show the effect of
current API on the read side.
I am also +1 on having the option to downcast ns to us. This could also be a
posssible solution for ORC format support issue:
https://github.com/apache/iceberg-python/pull/790#discussion_r1632797941
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]