[I] Disable or configure date and/or time inference when reading CSV files [arrow]

via GitHub Mon, 17 Mar 2025 03:47:39 -0700


rudolfbyker opened a new issue, #45814:
URL: https://github.com/apache/arrow/issues/45814

### Describe the enhancement requested

## Status quo

-
[`ConvertOptions`](https://arrow.apache.org/docs/python/generated/pyarrow.csv.ConvertOptions.html#pyarrow.csv.ConvertOptions)
has `column_types` and `timestamp_parsers`.
- `column_types` allows specifying a specific column type, but it does not
allow limiting the inference to a subset of all available types (e.g., saying
that column "A" may be string, int, or bool, but not date or time.)
- `timestamp_parsers` does not seem to allow disabling date detection (e.g.
for a string like "2025-03-17"), and it's not configurable per column.

## Use case / application

In the software I'm writing, we follow the following rules for reading CSV
files:

- The user may specify one or more data types which is allowed for a column.
The type is inferred, but restricted to one of those types.
- If the user does not specify the allowed data types for a column, the
allowed types are string and float, since numeric columns are very common, and
much easier to detect reliably than dates and/or times.

We often encounter data containing strings which look like dates, but aren't.

## Suggestions

Some of these suggestions are complementary, and I would understand if you
want to split them into separate issues:

- Allow specifying a list of allowed data types per column, and a default
list of allowed data types.
- This could be done using a `defaultdict` in Python, OR two separate
options in `ConvertOptions` where one is `Mapping[str, Sequence[DataType]]`
(keyed by column name) and the other is `Sequence[DataType]`.
- Duck DB has an option like this, called
[`auto_type_candidates`](https://duckdb.org/docs/stable/data/csv/overview.html),
but unfortunately, theirs is also not configurable per column.
- What would be even more amazing is to map regular expressions (to be
executed on the column names) to lists of allowed data types. (Example use
case: All columns named `foo_\d+` should be limited to `float|int`.)
- Make `timestamp_parsers` configurable per column, but keep a default
global one that works for columns that are not specified explicitly.
- Make something similar to `timestamp_parsers` that also work for dates.

### Component(s)

Python

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] Disable or configure date and/or time inference when reading CSV files [arrow]

Reply via email to