Matthieusalor opened a new issue, #45062:
URL: https://github.com/apache/arrow/issues/45062

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   When using coerce_timestamps=us, it seems that the parquet metadata are 
correclty being set as datetime[us] however, the stored binary arrow schema 
seems to still be a datetime[ns]. 
   
   ```python
   import pandas as pd
   import polars as pl
   from pyarrow.parquet import ParquetFile, ParquetWriter
   
   
   df = pd.DataFrame({"date": [pd.Timestamp.now()]})
   df.to_parquet("us.parquet", coerce_timestamps="us", 
allow_truncated_timestamps=True, index=False)
   
   pqf = ParquetFile("us.parquet")
   writer = ParquetWriter("us_pyarrow.parquet", 
schema=pqf.schema.to_arrow_schema())
   writer.write_table(pqf.read())
   writer.close()
   
   
   pl.read_parquet("us.parquet") # Gives datetime[ns]
   pl.read_parquet("us_pyarrow.parquet") # Gives datetime[us]
   
   ParquetFile("us.parquet").metadata.schema.to_arrow_schema() # gives 
datetime[us]
   ParquetFile("us_pyarrow.parquet").metadata.schema.to_arrow_schema() # gives 
datetime[us]
   ```
   
   Polars is probably leveraging the binary arrow schema stored in the 
metadata. 
   Running the following prevent the mismatch when using polars and we indeed 
get datetime[us]
   
   ```python
   df.to_parquet("us.parquet", coerce_timestamps="us", 
allow_truncated_timestamps=True, index=False, store_schema=False)
   ```
   
   The issue is that store_schema is not supported in the `write_to_dataset` 
function as this parameter is not available in `ParquetFileWriteOptions` but 
only in the `write_table` function and `ParquetWriter` class.
   
   Therefore, running the following doesn't work.
   
   ```python
   df.to_parquet("us.parquet", coerce_timestamps="us", 
allow_truncated_timestamps=True, index=False, store_schema=False, 
partition_cols=[])
   ```
   
   I guess the stored binary arrow schema should be stored with datetime[us] 
instead of datetime[ns] when using `coerce_timestamps` parameter but the 
`store_schema` parameter should also be made available to the 
`write_to_dataset` function
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to