dragosmg opened a new issue, #49836:
URL: https://github.com/apache/arrow/issues/49836
### Describe the bug, including details regarding any error messages,
version, and platform.
I am not 100% convinced this is a bug or rather some unexpected behaviour.
When an R `Date` vector contains `Inf` or `-Inf`, converting to an Arrow
table (e.g. via `as_arrow_table()` or `write_parquet()`) silently converts
these to extreme, but finite, dates instead of `null` or raising an error.
**Reprex**:
``` r
library(arrow)
library(tibble)
library(dplyr)
tibble(x = as.Date(c(Inf, -Inf))) |>
as_arrow_table() |>
collect()
#> # A tibble: 2 × 1
#> x
#> <date>
#> 1 5881580-07-11
#> 2 -5877641-06-23
```
Second example (includes `NaN`):
``` r
library(arrow)
library(tibble)
library(dplyr)
chunks <- tibble(x = as.Date(c(Inf, -Inf, NaN))) |>
as_arrow_table()
chunks$columns
#> [[1]]
#> ChunkedArray
#> <date32[day]>
#> [
#> [
#> <value out of range: 2147483647>,
#> <value out of range: -2147483648>,
#> 1970-01-01
#> ]
#> ]
```
A final reprex to highlight the problematic aspect of casting to `int32`
which results in `NA_integer_` clashing with `INT_MIN`:
``` r
library(arrow)
library(tibble)
library(dplyr)
tibble(x = as.Date(c(Inf, -Inf, NaN))) |>
as_arrow_table() |>
collect() |>
mutate(y = as.integer(x))
#> Warning: There was 1 warning in `mutate()`.
#> ℹ In argument: `y = as.integer(x)`.
#> Caused by warning:
#> ! NAs introduced by coercion to integer range
#> # A tibble: 3 × 2
#> x y
#> <date> <int>
#> 1 5881580-07-11 2147483647
#> 2 -5877641-06-23 NA
#> 3 1970-01-01 0
```
**Expected behaviour**:
I think it might be better if `Inf`/`-Inf` dates are either converted to
`null` in the Arrow array, or the conversion should error/warn indicating that
infinite date values are not representable in `date32`.
**Root cause**:
In https://github.com/apache/arrow/blob/main/r/src/r_to_arrow.cpp#L600,
`FromRdate` for `Date32Type` does:
```
static int FromRDate(const Date32Type*, double from) {
return static_cast<int>(std::floor(from));
}
```
As far as I understand, `static_cast<int>(std::floor(Inf))` is undefined
behaviour in C++. On most platforms this would produce `INT_MAX`/ `INT_MIN`,
which Arrow then interprets as concrete dates ~5.8 million years from epoch.
A possible fix would be to check for non-finite values before the cast:
```
static int FromRDate(const Date32Type*, double from) {
if (!std::infinite(from)) {
// handle as null or error/ warn
}
return static_cast<int>(std::floor(from));
}
```
`NaN` date are also affected by the same behaviour. Arguably more
problematic as the roundtrip transforms them into `0` (i.e. epoch).
### Component(s)
R
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]