rudolfbyker opened a new issue, #45814: URL: https://github.com/apache/arrow/issues/45814
### Describe the enhancement requested ## Status quo - [`ConvertOptions`](https://arrow.apache.org/docs/python/generated/pyarrow.csv.ConvertOptions.html#pyarrow.csv.ConvertOptions) has `column_types` and `timestamp_parsers`. - `column_types` allows specifying a specific column type, but it does not allow limiting the inference to a subset of all available types (e.g., saying that column "A" may be string, int, or bool, but not date or time.) - `timestamp_parsers` does not seem to allow disabling date detection (e.g. for a string like "2025-03-17"), and it's not configurable per column. ## Use case / application In the software I'm writing, we follow the following rules for reading CSV files: - The user may specify one or more data types which is allowed for a column. The type is inferred, but restricted to one of those types. - If the user does not specify the allowed data types for a column, the allowed types are string and float, since numeric columns are very common, and much easier to detect reliably than dates and/or times. We often encounter data containing strings which look like dates, but aren't. ## Suggestions Some of these suggestions are complementary, and I would understand if you want to split them into separate issues: - Allow specifying a list of allowed data types per column, and a default list of allowed data types. - This could be done using a `defaultdict` in Python, OR two separate options in `ConvertOptions` where one is `Mapping[str, Sequence[DataType]]` (keyed by column name) and the other is `Sequence[DataType]`. - Duck DB has an option like this, called [`auto_type_candidates`](https://duckdb.org/docs/stable/data/csv/overview.html), but unfortunately, theirs is also not configurable per column. - What would be even more amazing is to map regular expressions (to be executed on the column names) to lists of allowed data types. (Example use case: All columns named `foo_\d+` should be limited to `float|int`.) - Make `timestamp_parsers` configurable per column, but keep a default global one that works for columns that are not specified explicitly. - Make something similar to `timestamp_parsers` that also work for dates. ### Component(s) Python -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org