dxdc opened a new issue, #47502:
URL: https://github.com/apache/arrow/issues/47502
### Describe the enhancement requested
## Motivation / use-case
Many users need to coerce **all** columns of a CSV to the same Arrow
type—most commonly `string()` to keep raw text—when the schema is unknown or
very wide.
Today the API only permits either:
* passing an explicit `column_types={"colA": pa.string(), …}` map, **or**
* letting the reader infer per-column types.
That forces callers to a) know every header in advance and b) enumerate
them, which is painful for dynamic files.
The limitation was raised in
[*ARROW-5811*](https://github.com/apache/arrow/issues/22232).
Current docs confirm no built-in way exists beyond the explicit map.
---
## Proposed change
### **Option A – sentinel entry in `column_types`**
Honor a magic key (e.g. `"*"`, `"__default__"`, or a constant
`kWildcardColumn`) inside `ConvertOptions.column_types`.
Lookup order in `MakeConversionSchema()` becomes:
1. exact match in `column_types`
2. sentinel key
3. current fallback (type inference)
### **Option B – new field `default_column_type`**
Add `std::shared_ptr<DataType> default_column_type = nullptr` to
`ConvertOptions`.
If non-null, columns **not** listed in `column_types` are converted with
that type.
Both approaches are backwards-compatible; Option B is explicit and avoids
magic strings, while Option A is a one-line API addition.
---
## Python examples
```python
import pyarrow as pa, pyarrow.csv as pcsv
# Option A (sentinel)
opts = pcsv.ConvertOptions(column_types={"*": pa.string(), "id": pa.int64()})
tbl = pcsv.read_csv("data.csv", convert_options=opts)
# Option B (explicit field)
opts = pcsv.ConvertOptions(
default_column_type=pa.string(), # NEW
column_types={"id": pa.int64()} # explicit override
)
tbl = pcsv.read_csv("data.csv", convert_options=opts)
```
### Affected code (C++ path overview)
| Layer | File(s) | Change summary | Notes |
|-------|---------|----------------|-------|
| **Public API** | `cpp/src/arrow/csv/options.h` | • **Add**
`std::shared_ptr<DataType> default_column_type;` to `struct ConvertOptions`
(Option B) **or** define `static const std::string kWildcardColumn =
"__default__";` (Option A).<br>• Document the new knob in the Doxygen comment.
| Keeps the setting user-visible. |
| | `cpp/src/arrow/csv/options.cc` | • In `ConvertOptions::Defaults()`,
initialise `opts.default_column_type = nullptr;`.<br>• Extend
`ConvertOptions::Validate()` to raise `Status::Invalid` for an illegal dtype or
duplicate sentinel. | Ensures default behaviour remains unchanged. |
| **Core logic** | `cpp/src/arrow/csv/reader.cc` — inside
`MakeConversionSchema()` | Replace the existing two-branch decision with a
three-branch cascade:<br> 1. **explicit mapping** →<br> 2.
**default_column_type / sentinel** →<br> 3. **infer type** (legacy path). |
~10 LOC patch; confined to one lambda. |
| **Unit tests (C++)** | `cpp/src/arrow/csv/options_test.cc` (new) | Add
three cases:<br>• default only – every column gets that type.<br>• default +
explicit overrides – explicit wins.<br>• default == nullptr – legacy inference.
| Guards against regressions. |
| **Python binding** | `python/pyarrow/_csv.cpp` (Cython) | • **Expose**
`default_column_type` keyword (accept `None` or `DataType`).<br>• Map to/from
the underlying C++ field. | Maintains PyArrow feature parity. |
| | `python/pyarrow/tests/test_csv.py` | Mirror the three C++ test
scenarios. | Confirms binding wiring. |
| **Documentation** | `docs/source/cpp/csv.rst`,
`docs/source/python/csv.rst` | Add one bullet and a quick example for the new
option. | Makes the feature discoverable. |
| **Other bindings** (optional) | R, GLib, Rust wrappers | Add the
field/property if those wrappers already expose `ConvertOptions`. | Can be
staged separately. |
> **Build system:** No CMake or Meson tweaks are required—the
dataset/file-CSV paths automatically inherit the updated `ConvertOptions`.
---
### Cross-language bindings checklist
| Language | File / area | Binding note |
|----------|-------------|--------------|
| **Python (pyarrow)** | `_csv.cpp` | add `default_column_type` kwarg with
`None` ⇒ `nullptr` |
| **R** (`arrow::r::csv`) | `r/src/` | mirror the field in
`convert_options()` constructor |
| **GLib** | `glib/arrow-gio/csv-options.cpp` | expose property
`default-column-type` |
| **Rust** | `arrow-csv` crate | add `default_column_type: Option<DataType>`
|
| **Java / JNI** | none (CSV reader lives in C++ backend) | no change |
These additions are mechanical once the C++ core is in place.
---
### Component(s)
C++
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]