TPDeramus opened a new issue, #43747:
URL: https://github.com/apache/arrow/issues/43747

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   Hi Arrow Devs.
   
   Not sure if this would be more of a usage question or a bug, but it 
certainly seemed to behave like the latter.
   
   I have a large amount of data that needs to be queried in a specific way 
using a combination of hard (exact matches) and soft (begins with) string 
matches based on another condition (e.g. value of "class" in another column).
   
   I was recently able to get this working via `if_else` with many other 
singular instances where the queries were smaller, but with this larger dataset 
I need to recruit `case_when` for a set of larger, conditional mutations based 
on which "database" the query belongs to.
   
   However, every time I do this, the package seems to throw an error that it 
cannot be run in `arrow` and needs to be pulled into R:
   ```
   FilteredOutput <- LargeDataset |> 
   select(ID, BeginDate, EndDate, class, identity_string, database) |> 
   distinct() |> 
   mutate(Foundate = as.Date("1700-01-01"), Logdate = case_when((((class == "A" 
& arrow_is_in(identity_string,options = list(value_set = Array$create(HardA), 
skip_nulls = TRUE))) | (class == "A" & str_starts(identity_string, WildA)) | 
(class == "B" & arrow_is_in( identity_string, options = list(value_set = 
Array$create(HardB),skip_nulls = TRUE))) | (class == "C" & 
arrow_is_in(identity_string, options = list(value_set = 
Array$create(CHard),skip_nulls = TRUE)))) & database == "secondary") ~ EndDate, 
database == "primary" ~ as.Date(as.integer(BeginDate) + 30), .default = 
Foundate))
   
   Error: Expression case_when((((arrow_is_in(identity_string, options = 
list(value_set = Array$create(HardA), ... not supported in Arrow
   Call collect() first to pull data into R
   ```
   
   I assumed this was some kind of issue with doing a date conversion in 
`arrow`, which I realize does not have hard-mapped `tidyverse` commands for 
doing mathematical operations on dates (e.g. I need to add 30 days to BeginDate 
and the like, hence why I used Foundate as an imputed placeholder that would 
not be possible to remove later.
   
   However, oddly enough, every single one of these other options seem to work:
   
   ```
   #Works (adding to date permitted, no case_when)
   FilteredOutput <- LargeDataset |> 
   select(ID, BeginDate, EndDate, class, identity_string, database) |> 
   distinct() |> 
   mutate(Logdate = if_else(database == "primary", 
as.Date(as.integer(BeginDate) + 30), EndDate)) |> as_arrow_table()
   
   #Works (no case_when, no math performed on date)
   FilteredOutput <- LargeDataset |> 
   select(ID, BeginDate, EndDate, class, identity_string, database) |> 
   distinct() |> 
   mutate(Logdate = if_else((((class == "A" & 
arrow_is_in(identity_string,options = list(value_set = Array$create(HardA), 
skip_nulls = TRUE))) | (class == "A" & str_starts(identity_string, WildA)) | 
(class == "B" & arrow_is_in(identity_string, options = list(value_set = 
Array$create(HardB),skip_nulls = TRUE))) | (class == "C" & 
arrow_is_in(identity_string, options = list(value_set = 
Array$create(CHard),skip_nulls = TRUE)))) & database == "secondary"), EndDate, 
NA)) |> as_arrow_table()
   
   #Works (no case when, adding to date, permits operations on date)
   FilteredOutput <- LargeDataset |> 
   select(ID, BeginDate, EndDate, class, identity_string, database) |> 
   distinct() |> 
   mutate(Foundate = as.Date("1700-01-01"), Logdate = if_else(database == 
"primary", as.Date(as.integer(BeginDate) + 30), Foundate)) |> as_arrow_table()
   
   
   #This Works(HAS case_when, but value for met criteria is an integer, not a 
date)
   > FilteredOutput <- LargeDataset |> 
   select(ID, BeginDate, EndDate, class, identity_string, database) |> 
   distinct() |> 
   mutate(Foundate = as.Date("1700-01-01"), Logdate = case_when((((class == "A" 
& arrow_is_in(identity_string,options = list(value_set = Array$create(HardA), 
skip_nulls = TRUE))) | (class == "A" & str_starts(identity_string, WildA)) | 
(class == "B" & arrow_is_in( identity_string, options = list(value_set = 
Array$create(HardB),skip_nulls = TRUE))) | (class == "C" & 
arrow_is_in(identity_string, options = list(value_set = 
Array$create(CHard),skip_nulls = TRUE)))) & database == "secondary") ~ 1, 
database == "primary" ~ 0, .default = 2)) |> as_arrow_table()
   
   
   #This Works (no case_when, saves as date with no math)
   FilteredOutput <- LargeDataset |> 
   select(ID, BeginDate, EndDate, class, identity_string, database) |> 
   distinct() |> 
   mutate(Foundate = as.Date("1700-01-01"), Logdate = if_else((((class == "A" & 
arrow_is_in(identity_string,options = list(value_set = Array$create(HardA), 
skip_nulls = TRUE))) | (class == "A" & str_starts(identity_string, WildA)) | 
(class == "B" & arrow_is_in(identity_string, options = list(value_set = 
Array$create(HardB),skip_nulls = TRUE))) | (class == "C" & 
arrow_is_in(identity_string, options = list(value_set = 
Array$create(CHard),skip_nulls = TRUE)))) & database == "secondary"), EndDate, 
Foundate)) |> as_arrow_table()
   
   ```
   
   Am I facing some sort of syntax issue or is there a limitation when 
attempting to set `case_when` conditions with regards to dates?
   
   Thank you in advance for any suggestions.
   
   ### Component(s)
   
   R


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to