Re: [PR] Cast 's', 'ms' and 'ns' PyArrow timestamp to 'us' precision on write [iceberg-python]

via GitHub Wed, 03 Jul 2024 21:48:26 -0700


HonahX commented on code in PR #848:
URL: https://github.com/apache/iceberg-python/pull/848#discussion_r1665113108



##########
pyiceberg/io/pyarrow.py:
##########
@@ -918,11 +919,24 @@ def primitive(self, primitive: pa.DataType) -> 
PrimitiveType:
             return TimeType()
         elif pa.types.is_timestamp(primitive):
             primitive = cast(pa.TimestampType, primitive)
-            if primitive.unit == "us":
-                if primitive.tz == "UTC" or primitive.tz == "+00:00":
-                    return TimestamptzType()
-                elif primitive.tz is None:
-                    return TimestampType()
+            if primitive.unit in ("s", "ms", "us"):
+                # Supported types, will be upcast automatically to 'us'
+                pass
+            elif primitive.unit == "ns":
+                if Config().get_bool("downcast-ns-timestamp-on-write"):

Review Comment:
   Thanks for all the valuable discussion. 
   
   Sorry about the typo I made in the very first comment, it should be 
`pyarrow_to_schema`. Apologies if it creates any confusion.
   
   > I actually don't think it'll stop when it reads through 
to_requested_schema, because it will detect that the pyarrow types are 
different, but their IcebergTypes are the same and silently cast on read, which 
will drop the precision silently
   
   I got the type conversion error
   
https://github.com/apache/iceberg-python/blob/b8c5bb77c5ea436aeced17676aa30d09c1224ed9/pyiceberg/io/pyarrow.py#L1278
   If the timestamp value's nanosecond part is not empty
   
   
   <details>
   <summary>Example code that re-produce the issue (modified from a test in 
`test_add_files.py`): </summary>
   
   ```python
   ARROW_SCHEMA = pa.schema([
       ("foo", pa.bool_()),
       ("bar", pa.string()),
       ("baz", pa.int32()),
       ("qux", pa.date32()),
       ("quux", pa.timestamp("ns", tz="UTC")),
   ])
   
   ARROW_TABLE = pa.Table.from_pylist(
       [
           {
               "foo": True,
               "bar": "bar_string",
               "baz": 123,
               "qux": date(2024, 3, 7),
               "quux": 1615967687249846175, # 2021-03-17 07:54:47.249846159
           }
       ],
       schema=ARROW_SCHEMA,
   )
   
   @pytest.mark.integration
   def test_timestamp_tz(
       session_catalog: Catalog, format_version: int, mocker: MockerFixture
   ) -> None:
       mocker.patch.dict(os.environ, 
values={"PYICEBERG_DOWNCAST_NS_TIMESTAMP_ON_WRITE": "True"})
       identifier = f"default.unpartitioned_raises_not_found_v{format_version}"
       tbl = _create_table(session_catalog, identifier, format_version)
   
       file_paths = 
[f"s3://warehouse/default/unpartitioned_raises_not_found/v{format_version}/test-{i}.parquet"
 for i in range(5)]
       # write parquet files
       for file_path in file_paths:
           fo = tbl.io.new_output(file_path)
           with fo.create(overwrite=True) as fos:
               with pq.ParquetWriter(fos, schema=ARROW_SCHEMA) as writer:
                   writer.write_table(ARROW_TABLE)
   
       # add the parquet files as data files
       tbl.add_files(file_paths=file_paths)
   
       print(tbl.scan().to_arrow())
   ```
   </details>
   
   This is just an edge case that may better be resolved by having a check in 
`add_files`. I just wanted to use this as an example to show the effect of 
current API on the read side. 
   
   I am also +1 on having the option to downcast ns to us. This could also be a 
posssible solution for ORC format support issue: 
https://github.com/apache/iceberg-python/pull/790#discussion_r1632797941
   
   
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Cast 's', 'ms' and 'ns' PyArrow timestamp to 'us' precision on write [iceberg-python]

Reply via email to