blongworth opened a new issue, #45139:
URL: https://github.com/apache/arrow/issues/45139
### Describe the bug, including details regarding any error messages,
version, and platform.
slice_sample() in Arrow 18.1.0 does not randomly sample rows from an arrow
dataset. Reprex follows. I'd expect a random sample of rows, so x distributed
in 1:5000.
``` r
library(dplyr)
library(arrow)
df <- data.frame(x = 1:5000)
ds <- arrow_table(df)
slice_sample(ds, n = 10) |>
collect()
#> # A tibble: 10 × 1
#> x
#> <int>
#> 1 3
#> 2 4
#> 3 16
#> 4 17
#> 5 65
#> 6 66
#> 7 74
#> 8 93
#> 9 123
#> 10 129
```
<sup>Created on 2024-12-31 with [reprex
v2.1.1](https://reprex.tidyverse.org)</sup>
<details style="margin-bottom:10px;">
<summary>
Session info
</summary>
``` r
sessioninfo::session_info()
#> ─ Session info
───────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.4.2 (2024-10-31)
#> os macOS Sonoma 14.7
#> system aarch64, darwin20
#> ui X11
#> language (EN)
#> collate en_US.UTF-8
#> ctype en_US.UTF-8
#> tz America/New_York
#> date 2024-12-31
#> pandoc 3.2 @
/Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/aarch64/ (via
rmarkdown)
#>
#> ─ Packages
───────────────────────────────────────────────────────────────────
#> package * version date (UTC) lib source
#> arrow * 18.1.0 2024-12-05 [1] CRAN (R 4.4.1)
#> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.4.0)
#> bit 4.0.5 2022-11-15 [1] CRAN (R 4.4.0)
#> bit64 4.0.5 2020-08-30 [1] CRAN (R 4.4.0)
#> cli 3.6.2 2023-12-11 [1] CRAN (R 4.4.0)
#> digest 0.6.35 2024-03-11 [1] CRAN (R 4.4.0)
#> dplyr * 1.1.4 2023-11-17 [1] CRAN (R 4.4.0)
#> evaluate 0.23 2023-11-01 [1] CRAN (R 4.4.0)
#> fansi 1.0.6 2023-12-08 [1] CRAN (R 4.4.0)
#> fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.4.0)
#> fs 1.6.4 2024-04-25 [1] CRAN (R 4.4.0)
#> generics 0.1.3 2022-07-05 [1] CRAN (R 4.4.0)
#> glue 1.7.0 2024-01-09 [1] CRAN (R 4.4.0)
#> htmltools 0.5.8.1 2024-04-04 [1] CRAN (R 4.4.0)
#> knitr 1.46 2024-04-06 [1] CRAN (R 4.4.0)
#> lifecycle 1.0.4 2023-11-07 [1] CRAN (R 4.4.0)
#> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.4.0)
#> pillar 1.9.0 2023-03-22 [1] CRAN (R 4.4.0)
#> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.4.0)
#> purrr 1.0.2 2023-08-10 [1] CRAN (R 4.4.0)
#> R6 2.5.1 2021-08-19 [1] CRAN (R 4.4.0)
#> reprex 2.1.1 2024-07-06 [1] CRAN (R 4.4.0)
#> rlang 1.1.3 2024-01-10 [1] CRAN (R 4.4.0)
#> rmarkdown 2.26 2024-03-05 [1] CRAN (R 4.4.0)
#> rstudioapi 0.16.0 2024-03-24 [1] CRAN (R 4.4.0)
#> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.4.0)
#> tibble 3.2.1 2023-03-20 [1] CRAN (R 4.4.0)
#> tidyselect 1.2.1 2024-03-11 [1] CRAN (R 4.4.0)
#> utf8 1.2.4 2023-10-22 [1] CRAN (R 4.4.0)
#> vctrs 0.6.5 2023-12-01 [1] CRAN (R 4.4.0)
#> withr 3.0.0 2024-01-16 [1] CRAN (R 4.4.0)
#> xfun 0.49 2024-10-31 [1] CRAN (R 4.4.1)
#> yaml 2.3.8 2023-12-11 [1] CRAN (R 4.4.0)
#>
#> [1] /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library
#>
#>
──────────────────────────────────────────────────────────────────────────────
```
</details>
As this issue could be dangerous for someone assuming a random sample,
should there be a note in the docs or slice_sample() removed until it's fixed?
### Component(s)
R
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]