davlee1972 opened a new issue, #45936:
URL: https://github.com/apache/arrow/issues/45936

   ### Describe the enhancement requested
   
   This impacts pyarrow.compute.cast() and reading pyarrow datasets using 
schemas for Text and Json files.
   
   It also impacts SQL result sets which return string values for date/datetime 
columns. (Tested using ADBC).
   
   Can we add some string to date32 and date64 conversions which strips the 
time portion from YYYY-MM-DD HH:MM:SS.ffff??
   
   For Json YYYY-MM-DD can only be converted to timestamp[s]. This is 
inconsistent with the CSV reader which will by default converts YYYY-MM-DD into 
date32..
   
   ```
   >>> import pyarrow.compute as pc
   >>> import pyarrow.dataset as ds
   >>>
   >>> # This works
   >>> today = pa.scalar('2025-03-24')
   >>> pc.cast(today, "date32")
   <pyarrow.Date32Scalar: datetime.date(2025, 3, 24)>
   >>> pc.cast(today, "timestamp[s]").cast("date32")
   <pyarrow.Date32Scalar: datetime.date(2025, 3, 24)>
   >>>
   >>> # This works if you cast first to timestamp
   >>> today = pa.scalar('2025-03-24 00:00:00')
   >>> pc.cast(today, "date32")
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "/u1/leed/miniconda3/lib/python3.9/site-packages/pyarrow/compute.py", 
line 405, in cast
       return call_function("cast", [arr], options, memory_pool)
     File "pyarrow/_compute.pyx", line 598, in pyarrow._compute.call_function
     File "pyarrow/_compute.pyx", line 393, in pyarrow._compute.Function.call
     File "pyarrow/error.pxi", line 155, in 
pyarrow.lib.pyarrow_internal_check_status
     File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
   pyarrow.lib.ArrowInvalid: Failed to parse string: '2025-03-24 00:00:00' as a 
scalar of type date32[day]
   >>> pc.cast(today, "timestamp[s]").cast("date32")
   <pyarrow.Date32Scalar: datetime.date(2025, 3, 24)>
   >>>
   >>> # For text you also can't parse dates with 00:00:00 into dates
   >>> with open("test.csv", "w") as f:
   ...     f.write("today\n")
   ...     f.write("2025-03-24 00:00:00\n")
   ...     f.write("2025-03-24 00:00:00\n")
   ...     f.write("2025-03-24 00:00:00\n")
   ...
   6
   20
   20
   20
   >>> text_dataset = ds.dataset("test.csv", format="csv", 
schema=pa.schema([pa.field("today", "date32")]))
   >>> text_dataset.schema
   today: date32[day]
   >>> text_dataset.head(10)
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "pyarrow/_dataset.pyx", line 730, in pyarrow._dataset.Dataset.head
     File "pyarrow/_dataset.pyx", line 3911, in pyarrow._dataset.Scanner.head
     File "pyarrow/error.pxi", line 155, in 
pyarrow.lib.pyarrow_internal_check_status
     File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
   pyarrow.lib.ArrowInvalid: Could not open CSV input source 'test.csv': 
Invalid: In CSV column #0: Row #2: CSV conversion error to date32[day]: invalid 
value '2025-03-24 00:00:00'
   >>>
   >>> # JSON only supports timestamps.
   >>> with open("test.json", "w") as f:
   ...     f.write('{"today": "2025-03-24"}\n')
   ...     f.write('{"today": "2025-03-24"}\n')
   ...     f.write('{"today": "2025-03-24"}\n')
   ...
   24
   24
   24
   >>> json_dataset = ds.dataset("test.json", format="json", 
schema=pa.schema([pa.field("today", "date32")]))
   >>> json_dataset.schema
   today: date32[day]
   >>> json_dataset.head(10)
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "pyarrow/_dataset.pyx", line 730, in pyarrow._dataset.Dataset.head
     File "pyarrow/_dataset.pyx", line 3911, in pyarrow._dataset.Scanner.head
     File "pyarrow/error.pxi", line 155, in 
pyarrow.lib.pyarrow_internal_check_status
     File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
   pyarrow.lib.ArrowInvalid: Could not open JSON input source 'test.json': 
Invalid: JSON parse error: Column(/today) changed from number to string in row 0
   >>>
   >>> json_dataset = ds.dataset("test.json", format="json")
   >>> json_dataset.schema
   today: timestamp[s]
   >>> json_dataset.head(10)
   pyarrow.Table
   today: timestamp[s]
   ----
   today: [[2025-03-24 00:00:00,2025-03-24 00:00:00,2025-03-24 00:00:00]]
   
   ```
   
   ### Component(s)
   
   C++, Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to