boshek opened a new issue, #46013: URL: https://github.com/apache/arrow/issues/46013
### Describe the bug, including details regarding any error messages, version, and platform. ## Description When working with partitioned CSV datasets in Arrow, there's an interaction between schema specification and type inference, particularly with numeric data. An example of this occurs when you have a floating-point column that initially contains only integer-like values (e.g., `15`), causing Arrow to incorrectly infer the column as integer type, which fails when it later encounters decimal values (`23.4`). I've tried to put this into a coherent reprex-like issue below. ## Set up data ```r library(arrow, warn.conflicts = FALSE) library(dplyr, warn.conflicts = FALSE) library(fs) # Setup example directories and data create_dirs <- function() { dir_create("partitioned_example") dir_create("partitioned_example/month=12/year=1890") dir_create("partitioned_example/month=4/year=2011") } # Create sample data create_sample_data <- function() { data_1890_12 <- data.frame( station_number = c("08MF005", "08MF005", "08MF005"), date = as.Date(c("1890-12-01", "1890-12-02", "1890-12-03")), value = c(15, 14, 16) # Integer-like values ) data_2011_04 <- data.frame( station_number = c("08MF005", "08MF005", "08MF005"), date = as.Date(c("2011-04-01", "2011-04-02", "2011-04-03")), value = c(23.4, 24.8, 25.1) # Floating point values ) # mimic partitioned structure write.csv(data_1890_12, "partitioned_example/month=12/year=1890/part-0.csv", row.names = FALSE) write.csv(data_2011_04, "partitioned_example/month=4/year=2011/part-0.csv", row.names = FALSE) } setup_example <- function() { create_dirs() create_sample_data() } setup_example() ``` ## Problem ### Approach 1: Default behaviour fails due to type inference ```r ## fails because first csv read only has integer-like numbers open_csv_dataset("partitioned_example/") |> filter(value > 3.2, month == 4) |> collect() #> Error in `compute.arrow_dplyr_query()`: #> ! Invalid: Could not open CSV input source '/.../partitioned_example/month=4/year=2011/part-0.csv': #> Invalid: In CSV column #2: Row #2: CSV conversion error to int64: invalid value '23.4' ``` ### Approach 2: Providing explicit schema fails with column count error ```r schema <- schema( station_number = string(), date = date32(), value = float64(), month = int32(), year = int32() ) ## fails with a column # error result <- open_csv_dataset("partitioned_example", schema = schema) |> filter(value > 3.2, month == 4) |> collect() #> Error in `compute.arrow_dplyr_query()`: #> ! Invalid: Could not open CSV input source '/.../partitioned_example/month=4/year=2011/part-0.csv': #> Invalid: CSV parse error: Row #1: Expected 5 columns, got 3: "station_number","date","value" ``` ### Approach 3: Excluding partition columns from schema ```r schema <- schema( station_number = string(), date = date32(), value = float64() ) result <- open_csv_dataset("partitioned_example", schema = schema) |> filter(value > 3.2, month == 4) |> collect() #> Error in `month == 4`: #> ! Expression not supported in Arrow #> → Call collect() first to pull data into R. ``` ### Approach 4: Using hive_partition with explicit schema ```r schema <- schema( station_number = string(), date = date32(), value = float64() ) partitioning <- hive_partition( month = int32(), year = int32() ) result <- open_csv_dataset("partitioned_example", schema = schema, partitioning = partitioning) |> filter(value > 3.2, month == 4) |> collect() #> Error in `month == 4`: #> ! Expression not supported in Arrow #> → Call collect() first to pull data into R. ``` ### Approach 5: Working solution, but a bit unintuitive ```r open_csv_dataset("partitioned_example/", col_types = schema(value = float64())) |> filter(value > 3.2, month == 4) |> collect() #> # A tibble: 3 × 5 #> station_number date value month year #> <chr> <date> <dbl> <int> <int> #> 1 08MF005 2011-04-01 23.4 4 2011 #> 2 08MF005 2011-04-02 24.8 4 2011 #> 3 08MF005 2011-04-03 25.1 4 2011 ``` ## Issues 1. The requirement to use `col_types` instead of `schema` for type specification is a bit unintuitive. 2. Folks need to think about both the Dataset schema and CSV column type inference simultaneously. ## Possible improvements 1. Consider having the CSV reader use the explicit schema by default when provided. 2. Improve error messages to suggest using `col_types` when schema specification fails. 3. Create more consistent behavior between `schema` and partitioned columns. 4. Better documentation around how schema interacts with partitioning. ## Environment ```r reprex v2.1.1 R version: 4.4.3 arrow: 19.0.1 ``` ### Component(s) R -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org