HonahX commented on code in PR #848: URL: https://github.com/apache/iceberg-python/pull/848#discussion_r1665113108
########## pyiceberg/io/pyarrow.py: ########## @@ -918,11 +919,24 @@ def primitive(self, primitive: pa.DataType) -> PrimitiveType: return TimeType() elif pa.types.is_timestamp(primitive): primitive = cast(pa.TimestampType, primitive) - if primitive.unit == "us": - if primitive.tz == "UTC" or primitive.tz == "+00:00": - return TimestamptzType() - elif primitive.tz is None: - return TimestampType() + if primitive.unit in ("s", "ms", "us"): + # Supported types, will be upcast automatically to 'us' + pass + elif primitive.unit == "ns": + if Config().get_bool("downcast-ns-timestamp-on-write"): Review Comment: Thanks for all the valuable discussion. Sorry about the typo I made in the very first comment, it should be `pyarrow_to_schema`. Apologies if it creates any confusion. > I actually don't think it'll stop when it reads through to_requested_schema, because it will detect that the pyarrow types are different, but their IcebergTypes are the same and silently cast on read, which will drop the precision silently I got the type conversion error https://github.com/apache/iceberg-python/blob/b8c5bb77c5ea436aeced17676aa30d09c1224ed9/pyiceberg/io/pyarrow.py#L1278 If the timestamp value's nanosecond part is not empty <details> <summary>Example code that re-produce the issue (modified from a test in `test_add_files.py`): </summary> ```python ARROW_SCHEMA = pa.schema([ ("foo", pa.bool_()), ("bar", pa.string()), ("baz", pa.int32()), ("qux", pa.date32()), ("quux", pa.timestamp("ns", tz="UTC")), ]) ARROW_TABLE = pa.Table.from_pylist( [ { "foo": True, "bar": "bar_string", "baz": 123, "qux": date(2024, 3, 7), "quux": 1615967687249846175, # 2021-03-17 07:54:47.249846159 } ], schema=ARROW_SCHEMA, ) @pytest.mark.integration def test_timestamp_tz( session_catalog: Catalog, format_version: int, mocker: MockerFixture ) -> None: mocker.patch.dict(os.environ, values={"PYICEBERG_DOWNCAST_NS_TIMESTAMP_ON_WRITE": "True"}) identifier = f"default.unpartitioned_raises_not_found_v{format_version}" tbl = _create_table(session_catalog, identifier, format_version) file_paths = [f"s3://warehouse/default/unpartitioned_raises_not_found/v{format_version}/test-{i}.parquet" for i in range(5)] # write parquet files for file_path in file_paths: fo = tbl.io.new_output(file_path) with fo.create(overwrite=True) as fos: with pq.ParquetWriter(fos, schema=ARROW_SCHEMA) as writer: writer.write_table(ARROW_TABLE) # add the parquet files as data files tbl.add_files(file_paths=file_paths) print(tbl.scan().to_arrow()) ``` </details> This is just an edge case that may better be resolved by having a check in `add_files`. I just wanted to use this as an example to show the effect of current API on the read side. I am also +1 on having the option to downcast ns to us. This could also be a posssible solution for ORC format support issue: https://github.com/apache/iceberg-python/pull/790#discussion_r1632797941 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org