TPDeramus opened a new issue, #43747:
URL: https://github.com/apache/arrow/issues/43747
### Describe the bug, including details regarding any error messages,
version, and platform.
Hi Arrow Devs.
Not sure if this would be more of a usage question or a bug, but it
certainly seemed to behave like the latter.
I have a large amount of data that needs to be queried in a specific way
using a combination of hard (exact matches) and soft (begins with) string
matches based on another condition (e.g. value of "class" in another column).
I was recently able to get this working via `if_else` with many other
singular instances where the queries were smaller, but with this larger dataset
I need to recruit `case_when` for a set of larger, conditional mutations based
on which "database" the query belongs to.
However, every time I do this, the package seems to throw an error that it
cannot be run in `arrow` and needs to be pulled into R:
```
FilteredOutput <- LargeDataset |>
select(ID, BeginDate, EndDate, class, identity_string, database) |>
distinct() |>
mutate(Foundate = as.Date("1700-01-01"), Logdate = case_when((((class == "A"
& arrow_is_in(identity_string,options = list(value_set = Array$create(HardA),
skip_nulls = TRUE))) | (class == "A" & str_starts(identity_string, WildA)) |
(class == "B" & arrow_is_in( identity_string, options = list(value_set =
Array$create(HardB),skip_nulls = TRUE))) | (class == "C" &
arrow_is_in(identity_string, options = list(value_set =
Array$create(CHard),skip_nulls = TRUE)))) & database == "secondary") ~ EndDate,
database == "primary" ~ as.Date(as.integer(BeginDate) + 30), .default =
Foundate))
Error: Expression case_when((((arrow_is_in(identity_string, options =
list(value_set = Array$create(HardA), ... not supported in Arrow
Call collect() first to pull data into R
```
I assumed this was some kind of issue with doing a date conversion in
`arrow`, which I realize does not have hard-mapped `tidyverse` commands for
doing mathematical operations on dates (e.g. I need to add 30 days to BeginDate
and the like, hence why I used Foundate as an imputed placeholder that would
not be possible to remove later.
However, oddly enough, every single one of these other options seem to work:
```
#Works (adding to date permitted, no case_when)
FilteredOutput <- LargeDataset |>
select(ID, BeginDate, EndDate, class, identity_string, database) |>
distinct() |>
mutate(Logdate = if_else(database == "primary",
as.Date(as.integer(BeginDate) + 30), EndDate)) |> as_arrow_table()
#Works (no case_when, no math performed on date)
FilteredOutput <- LargeDataset |>
select(ID, BeginDate, EndDate, class, identity_string, database) |>
distinct() |>
mutate(Logdate = if_else((((class == "A" &
arrow_is_in(identity_string,options = list(value_set = Array$create(HardA),
skip_nulls = TRUE))) | (class == "A" & str_starts(identity_string, WildA)) |
(class == "B" & arrow_is_in(identity_string, options = list(value_set =
Array$create(HardB),skip_nulls = TRUE))) | (class == "C" &
arrow_is_in(identity_string, options = list(value_set =
Array$create(CHard),skip_nulls = TRUE)))) & database == "secondary"), EndDate,
NA)) |> as_arrow_table()
#Works (no case when, adding to date, permits operations on date)
FilteredOutput <- LargeDataset |>
select(ID, BeginDate, EndDate, class, identity_string, database) |>
distinct() |>
mutate(Foundate = as.Date("1700-01-01"), Logdate = if_else(database ==
"primary", as.Date(as.integer(BeginDate) + 30), Foundate)) |> as_arrow_table()
#This Works(HAS case_when, but value for met criteria is an integer, not a
date)
> FilteredOutput <- LargeDataset |>
select(ID, BeginDate, EndDate, class, identity_string, database) |>
distinct() |>
mutate(Foundate = as.Date("1700-01-01"), Logdate = case_when((((class == "A"
& arrow_is_in(identity_string,options = list(value_set = Array$create(HardA),
skip_nulls = TRUE))) | (class == "A" & str_starts(identity_string, WildA)) |
(class == "B" & arrow_is_in( identity_string, options = list(value_set =
Array$create(HardB),skip_nulls = TRUE))) | (class == "C" &
arrow_is_in(identity_string, options = list(value_set =
Array$create(CHard),skip_nulls = TRUE)))) & database == "secondary") ~ 1,
database == "primary" ~ 0, .default = 2)) |> as_arrow_table()
#This Works (no case_when, saves as date with no math)
FilteredOutput <- LargeDataset |>
select(ID, BeginDate, EndDate, class, identity_string, database) |>
distinct() |>
mutate(Foundate = as.Date("1700-01-01"), Logdate = if_else((((class == "A" &
arrow_is_in(identity_string,options = list(value_set = Array$create(HardA),
skip_nulls = TRUE))) | (class == "A" & str_starts(identity_string, WildA)) |
(class == "B" & arrow_is_in(identity_string, options = list(value_set =
Array$create(HardB),skip_nulls = TRUE))) | (class == "C" &
arrow_is_in(identity_string, options = list(value_set =
Array$create(CHard),skip_nulls = TRUE)))) & database == "secondary"), EndDate,
Foundate)) |> as_arrow_table()
```
Am I facing some sort of syntax issue or is there a limitation when
attempting to set `case_when` conditions with regards to dates?
Thank you in advance for any suggestions.
### Component(s)
R
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]