rudolfbyker opened a new issue, #45814:
URL: https://github.com/apache/arrow/issues/45814

   ### Describe the enhancement requested
   
   ## Status quo
   
   - 
[`ConvertOptions`](https://arrow.apache.org/docs/python/generated/pyarrow.csv.ConvertOptions.html#pyarrow.csv.ConvertOptions)
 has `column_types` and `timestamp_parsers`.
   - `column_types` allows specifying a specific column type, but it does not 
allow limiting the inference to a subset of all available types (e.g., saying 
that column "A" may be string, int, or bool, but not date or time.)
   - `timestamp_parsers` does not seem to allow disabling date detection (e.g. 
for a string like "2025-03-17"), and it's not configurable per column.
   
   ## Use case / application
   
   In the software I'm writing, we follow the following rules for reading CSV 
files:
   
   - The user may specify one or more data types which is allowed for a column. 
The type is inferred, but restricted to one of those types.
   - If the user does not specify the allowed data types for a column, the 
allowed types are string and float, since numeric columns are very common, and 
much easier to detect reliably than dates and/or times.
   
   We often encounter data containing strings which look like dates, but aren't.
   
   ## Suggestions
   
   Some of these suggestions are complementary, and I would understand if you 
want to split them into separate issues:
   
   - Allow specifying a list of allowed data types per column, and a default 
list of allowed data types.
       - This could be done using a `defaultdict` in Python, OR two separate 
options in `ConvertOptions` where one is `Mapping[str, Sequence[DataType]]` 
(keyed by column name) and the other is `Sequence[DataType]`.
       - Duck DB has an option like this, called 
[`auto_type_candidates`](https://duckdb.org/docs/stable/data/csv/overview.html),
 but unfortunately, theirs is also not configurable per column.
       - What would be even more amazing is to map regular expressions (to be 
executed on the column names) to lists of allowed data types. (Example use 
case: All columns named `foo_\d+` should be limited to `float|int`.)
   - Make `timestamp_parsers` configurable per column, but keep a default 
global one that works for columns that are not specified explicitly.
   - Make something similar to `timestamp_parsers` that also work for dates.
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to