thisisnic opened a new issue, #37902:
URL: https://github.com/apache/arrow/issues/37902

   ### Describe the enhancement requested
   
   As described in 
https://github.com/apache/arrow/issues/14907#issuecomment-1735674700.  Reprex 
copied below.
   
   ```r
   library(dplyr)
   library(arrow)
   
   tbl1 <- tibble::tibble(
     a = 1:3,
     b = c("a", "b", NA),
     d = c(letters[4:6])
   )
   
   tbl2 <- tibble::tibble(
     b = c("b", NA),
     c = c("a should be 2", "a should be 3")
   )
   
   
   # Left join tibbles, NAs matched
   left_join(tbl2, tbl1)
   #> Joining with `by = join_by(b)`
   #> # A tibble: 2 × 4
   #>   b     c                 a d    
   #>   <chr> <chr>         <int> <chr>
   #> 1 b     a should be 2     2 e    
   #> 2 <NA>  a should be 3     3 f
   
   
   # Left join arrow table & tibble, NAs NOT matched
   left_join(as_arrow_table(tbl2), tbl1) %>% collect()
   #> # A tibble: 2 × 4
   #>   b     c                 a d    
   #>   <chr> <chr>         <int> <chr>
   #> 1 b     a should be 2     2 e    
   #> 2 <NA>  a should be 3    NA <NA>
   
   
   # Left join arrow table & arrow table, NAs NOT matched
   left_join(as_arrow_table(tbl2), as_arrow_table(tbl1)) %>% collect()
   #> # A tibble: 2 × 4
   #>   b     c                 a d    
   #>   <chr> <chr>         <int> <chr>
   #> 1 b     a should be 2     2 e    
   #> 2 <NA>  a should be 3    NA <NA>
   ```
   
   The problem here is the question of "can we join on `NA` values in arrow?"
   
   Not right now!  But, here are a couple of workarounds.  The first uses extra 
code, and the second passes the data to duckdb and back.
   
   
   ``` r
   library(arrow)
   library(dplyr)
   
   tbl1 <- tibble::tibble(
     a = 1:3,
     b = c("a", "b", NA),
     d = c(letters[4:6])
   )
   
   tbl2 <- tibble::tibble(
     b = c("b", NA),
     c = c("a should be 2", "a should be 3")
   )
   
   as_arrow_table(tbl2) |>
     # replace NAs in tbl2 with alternative value
     mutate(b = ifelse(is.na(b), "temp_value", b)) |>
     left_join(
       as_arrow_table(tbl1) |>
         # replace NAs in tbl1 with alternative value
         mutate(b = ifelse(is.na(b), "temp_value", b))
     ) |>
     # replace alternative value in results with NA
     mutate(b = ifelse(b == "temp_value", NA, b)) |>
     collect()
   #> # A tibble: 2 × 4
   #>   b     c                 a d    
   #>   <chr> <chr>         <int> <chr>
   #> 1 b     a should be 2     2 e    
   #> 2 <NA>  a should be 3     3 f
   ```
   
   Or with DuckDB:
   
   ```
   tbl1_duckdb <- as_arrow_table(tbl1) |>
     to_duckdb()
   
   as_arrow_table(tbl2) |>
     to_duckdb() |>
     left_join(tbl1_duckdb, na_matches = "na") |>
     collect()
   #> Joining with `by = join_by(b)`
   #> # A tibble: 2 × 4
   #>   b     c                 a d    
   #>   <chr> <chr>         <int> <chr>
   #> 1 b     a should be 2     2 e    
   #> 2 <NA>  a should be 3     3 f
   ```
   
   Note that we have to pass in `na_matches = "na"` explicitly in the example 
there as the default value when working with duckdb/dbplyr is "never" - 
basically reflecting that in SQL we can't join on NULL (NA) values.
   
   I also had a look to try to work out whether we can implement this in Arrow 
or not.  dbplyr implements a function `sql_expr_matches()` which is what allows 
matching on NAs, and here's a snippet from it:
   
   ```
   sql_expr_matches.DBIConnection <- function(con, x, y, ...) {
     glue_sql2(
       con,
       "CASE WHEN ({x} = {y}) OR ({x} IS NULL AND {y} IS NULL) ",
       "THEN 0 ",
       "ELSE 1 ",
       "END = 0"
     )
   }
   ```
   
   The C++ changes in #11579 *may* allow us to implement something like this in 
arrow - the unit tests in that PR certainly make it look feasible, though 
non-trivial.
   
   ### Component(s)
   
   R


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to