khuggins opened a new issue, #47558:
URL: https://github.com/apache/arrow/issues/47558

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   What i expected:
   via a `pandas.to_parquet()` i should be able to write to disk and preserve 
the type of my data
   
   What happened
   The type of my data changed
   
   Long description
   I'm storing an array of pandas timestamp objects within a column of a pandas 
dataframe. i'm writing this dataframe to disk and reading it back at a later 
date and concatenating to it. However, when i read the dataframe back in via 
`read_parquet`, the type of the timestamps in my array have changed from a 
pandas.Timestamp to a numpy.datetime64. 
   
   This makes it challenging to append to the dataframe on disk as I get this 
error:
   ```
   pyarrow.lib.ArrowInvalid: ('numpy.datetime64 scalars cannot be mixed with 
other Python scalar values currently', 'Conversion failed for column 
<column_name> with type object')
   ```
   
   The error is happening on the write and read. here's  a minimal example:
   
   ```
   from datetime import datetime, date
   import pandas as pd
   print(pd.__version__)
   
   df = pd.DataFrame([(1, "x", date.today(), [pd.to_datetime(datetime(2018, 1, 
2, 18, 53))])], 
                     columns=["number", "string", "date", "datetime"])
   print(df)
   print(type(df['datetime'][0][0]))
   df.to_parquet("/tmp/test_datetime.parquet")
   df2 = pd.read_parquet("/tmp/test_datetime.parquet")
   print(df2)
   print(type(df2['datetime'][0][0]))
   concat_df = pd.concat([df,df2])
   concat_df.to_parquet("/tmp/test_datetime_concat.parquet")
   ```
   
   the output of this snippet is:
   
   ```
   2.2.3
      number string        date               datetime
   0       1      x  2025-09-13  [2018-01-02 18:53:00]
   <class 'pandas._libs.tslibs.timestamps.Timestamp'>
      number string        date                      datetime
   0       1      x  2025-09-13  [2018-01-02T18:53:00.000000]
   <class 'numpy.datetime64'>
   ```
   with the associated error
   ```
   {
        "name": "ArrowInvalid",
        "message": "('numpy.datetime64 scalars cannot be mixed with other 
Python scalar values currently', 'Conversion failed for column datetime with 
type object')",
   }
   ```
   the pandas version is `2.2.3` and the pyarrow version is `18.1.0`. I'm not 
certain which one it is happening in, but because it happened on write, i'm 
starting here.
   
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to