boshek opened a new issue, #46013:
URL: https://github.com/apache/arrow/issues/46013

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   ## Description
   
   When working with partitioned CSV datasets in Arrow, there's an interaction 
between schema specification and type inference, particularly with numeric 
data. An example of this occurs when you have a floating-point column that 
initially contains only integer-like values (e.g., `15`), causing Arrow to 
incorrectly infer the column as integer type, which fails when it later 
encounters decimal values (`23.4`). I've tried to put this into a coherent 
reprex-like issue below. 
   
   ## Set up data
   
   ```r
   library(arrow, warn.conflicts = FALSE)
   library(dplyr, warn.conflicts = FALSE)
   library(fs)
   
   # Setup example directories and data
   create_dirs <- function() {
    dir_create("partitioned_example")
    dir_create("partitioned_example/month=12/year=1890")
    dir_create("partitioned_example/month=4/year=2011")
   }
   
   # Create sample data
   create_sample_data <- function() {
    data_1890_12 <- data.frame(
      station_number = c("08MF005", "08MF005", "08MF005"),
      date = as.Date(c("1890-12-01", "1890-12-02", "1890-12-03")),
      value = c(15, 14, 16)  # Integer-like values
    )
    data_2011_04 <- data.frame(
      station_number = c("08MF005", "08MF005", "08MF005"),
      date = as.Date(c("2011-04-01", "2011-04-02", "2011-04-03")),
      value = c(23.4, 24.8, 25.1)  # Floating point values
    )
    # mimic partitioned structure
    write.csv(data_1890_12, 
              "partitioned_example/month=12/year=1890/part-0.csv", 
              row.names = FALSE)
    
    write.csv(data_2011_04, 
              "partitioned_example/month=4/year=2011/part-0.csv", 
              row.names = FALSE)
   }
   
   setup_example <- function() {
    create_dirs()
    create_sample_data()
   }
   
   setup_example()
   ```
   
   ## Problem
   
   ### Approach 1: Default behaviour fails due to type inference
   
   ```r
   ## fails because first csv read only has integer-like numbers
   open_csv_dataset("partitioned_example/") |> 
    filter(value > 3.2, month == 4) |> 
    collect()
   #> Error in `compute.arrow_dplyr_query()`:
   #> ! Invalid: Could not open CSV input source 
'/.../partitioned_example/month=4/year=2011/part-0.csv': 
   #> Invalid: In CSV column #2: Row #2: CSV conversion error to int64: invalid 
value '23.4'
   ```
   
   ### Approach 2: Providing explicit schema fails with column count error
   
   ```r
   schema <- schema(
    station_number = string(),
    date = date32(),
    value = float64(),
    month = int32(),
    year = int32()
   )
   
   ## fails with a column # error
   result <- open_csv_dataset("partitioned_example", schema = schema) |>
    filter(value > 3.2, month == 4) |> 
    collect()
   #> Error in `compute.arrow_dplyr_query()`:
   #> ! Invalid: Could not open CSV input source 
'/.../partitioned_example/month=4/year=2011/part-0.csv': 
   #> Invalid: CSV parse error: Row #1: Expected 5 columns, got 3: 
"station_number","date","value"
   ```
   
   ### Approach 3: Excluding partition columns from schema
   
   ```r
   schema <- schema(
    station_number = string(),
    date = date32(),
    value = float64()
   )
   
   result <- open_csv_dataset("partitioned_example", schema = schema) |>
    filter(value > 3.2, month == 4) |> 
    collect()
   #> Error in `month == 4`:
   #> ! Expression not supported in Arrow
   #> → Call collect() first to pull data into R.
   ```
   
   ### Approach 4: Using hive_partition with explicit schema
   
   ```r
   schema <- schema(
    station_number = string(),
    date = date32(),
    value = float64()
   )
   
   partitioning <- hive_partition(
    month = int32(),
    year = int32()
   )
   
   result <- open_csv_dataset("partitioned_example", schema = schema, 
partitioning = partitioning) |>
    filter(value > 3.2, month == 4) |> 
    collect()
   #> Error in `month == 4`:
   #> ! Expression not supported in Arrow
   #> → Call collect() first to pull data into R.
   ```
   
   ### Approach 5: Working solution, but a bit unintuitive
   
   ```r
   open_csv_dataset("partitioned_example/", col_types = schema(value = 
float64())) |> 
    filter(value > 3.2, month == 4) |> 
    collect()
   #> # A tibble: 3 × 5
   #>   station_number date       value month  year
   #>   <chr>          <date>     <dbl> <int> <int>
   #> 1 08MF005        2011-04-01  23.4     4  2011
   #> 2 08MF005        2011-04-02  24.8     4  2011
   #> 3 08MF005        2011-04-03  25.1     4  2011
   ```
   
   ## Issues
   
   1. The requirement to use `col_types` instead of `schema` for type 
specification is a bit unintuitive.
   2. Folks need to think about both the Dataset schema and CSV column type 
inference simultaneously.
   
   
   ## Possible improvements
   
   1. Consider having the CSV reader use the explicit schema by default when 
provided.
   2. Improve error messages to suggest using `col_types` when schema 
specification fails.
   3. Create more consistent behavior between `schema` and partitioned columns.
   4. Better documentation around how schema interacts with partitioning.
   
   ## Environment
   
   ```r
   reprex v2.1.1
   R version: 4.4.3
   arrow: 19.0.1
   ```
   
   ### Component(s)
   
   R


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to