EeshanBembi opened a new pull request, #17553:
URL: https://github.com/apache/datafusion/pull/17553
## Summary
Enables DataFusion to read directories containing CSV files with
different numbers of columns by implementing schema union during
inference.
Previously, attempting to read multiple CSV files with different
column counts would fail with:
Arrow error: Csv error: incorrect number of fields for line 1,
expected 17 got 20
This was particularly problematic for evolving datasets where newer
files include additional columns (e.g., railway services data where
newer files added platform information).
## Changes
- **Enhanced CSV schema inference**: Modified
`infer_schema_from_stream` to create union schema from all files
instead of rejecting files with different column counts
- **Backward compatible**: Existing functionality unchanged,
requires explicit opt-in via `truncated_rows(true)`
- **Comprehensive testing**: Added unit tests for schema building
logic and integration test with real CSV scenarios
## Usage
```rust
// Read CSV directory with mixed column counts
let df = ctx.read_csv(
"path/to/csv/directory/",
CsvReadOptions::new().truncated_rows(true)
).await?;
```
Test Results
- ✅ All existing tests pass (368/368 DataFusion lib tests)
- ✅ All CSV functionality intact (125/125 CSV tests)
- ✅ New integration test verifies fix with 3-column and 6-column
CSV files
- ✅ Schema inference creates union schema with proper null handling
Example
Before this fix:
- services_2024.csv: 3 columns → ❌ Error when reading together
- services_2025.csv: 6 columns → ❌ "incorrect number of fields"
After this fix:
- Both files → ✅ Union schema with 6 columns
- Missing columns filled with nulls automatically
Closes #17516
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]