davlee1972 opened a new issue, #45936: URL: https://github.com/apache/arrow/issues/45936
### Describe the enhancement requested This impacts pyarrow.compute.cast() and reading pyarrow datasets using schemas for Text and Json files. It also impacts SQL result sets which return string values for date/datetime columns. (Tested using ADBC). Can we add some string to date32 and date64 conversions which strips the time portion from YYYY-MM-DD HH:MM:SS.ffff?? For Json YYYY-MM-DD can only be converted to timestamp[s]. This is inconsistent with the CSV reader which will by default converts YYYY-MM-DD into date32.. ``` >>> import pyarrow.compute as pc >>> import pyarrow.dataset as ds >>> >>> # This works >>> today = pa.scalar('2025-03-24') >>> pc.cast(today, "date32") <pyarrow.Date32Scalar: datetime.date(2025, 3, 24)> >>> pc.cast(today, "timestamp[s]").cast("date32") <pyarrow.Date32Scalar: datetime.date(2025, 3, 24)> >>> >>> # This works if you cast first to timestamp >>> today = pa.scalar('2025-03-24 00:00:00') >>> pc.cast(today, "date32") Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/u1/leed/miniconda3/lib/python3.9/site-packages/pyarrow/compute.py", line 405, in cast return call_function("cast", [arr], options, memory_pool) File "pyarrow/_compute.pyx", line 598, in pyarrow._compute.call_function File "pyarrow/_compute.pyx", line 393, in pyarrow._compute.Function.call File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: Failed to parse string: '2025-03-24 00:00:00' as a scalar of type date32[day] >>> pc.cast(today, "timestamp[s]").cast("date32") <pyarrow.Date32Scalar: datetime.date(2025, 3, 24)> >>> >>> # For text you also can't parse dates with 00:00:00 into dates >>> with open("test.csv", "w") as f: ... f.write("today\n") ... f.write("2025-03-24 00:00:00\n") ... f.write("2025-03-24 00:00:00\n") ... f.write("2025-03-24 00:00:00\n") ... 6 20 20 20 >>> text_dataset = ds.dataset("test.csv", format="csv", schema=pa.schema([pa.field("today", "date32")])) >>> text_dataset.schema today: date32[day] >>> text_dataset.head(10) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "pyarrow/_dataset.pyx", line 730, in pyarrow._dataset.Dataset.head File "pyarrow/_dataset.pyx", line 3911, in pyarrow._dataset.Scanner.head File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: Could not open CSV input source 'test.csv': Invalid: In CSV column #0: Row #2: CSV conversion error to date32[day]: invalid value '2025-03-24 00:00:00' >>> >>> # JSON only supports timestamps. >>> with open("test.json", "w") as f: ... f.write('{"today": "2025-03-24"}\n') ... f.write('{"today": "2025-03-24"}\n') ... f.write('{"today": "2025-03-24"}\n') ... 24 24 24 >>> json_dataset = ds.dataset("test.json", format="json", schema=pa.schema([pa.field("today", "date32")])) >>> json_dataset.schema today: date32[day] >>> json_dataset.head(10) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "pyarrow/_dataset.pyx", line 730, in pyarrow._dataset.Dataset.head File "pyarrow/_dataset.pyx", line 3911, in pyarrow._dataset.Scanner.head File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: Could not open JSON input source 'test.json': Invalid: JSON parse error: Column(/today) changed from number to string in row 0 >>> >>> json_dataset = ds.dataset("test.json", format="json") >>> json_dataset.schema today: timestamp[s] >>> json_dataset.head(10) pyarrow.Table today: timestamp[s] ---- today: [[2025-03-24 00:00:00,2025-03-24 00:00:00,2025-03-24 00:00:00]] ``` ### Component(s) C++, Python -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org