djvanderlaan opened a new issue, #46428: URL: https://github.com/apache/arrow/issues/46428
### Describe the bug, including details regarding any error messages, version, and platform. I noticed that some operations use substantially slower and use more memory under arrow V20.0.0 and v19.0.0 than under v17.0.0. I managed to reduce the example and am able to reproduce this both on a production machine running ubuntu 22.04 and my home desktop (running debian stable). The run of the example on my desktop with v17 took 10s and a maximum of approx 7GB memory. The v20 run was killed after 1m16s because it ran out of memory (my home machine is unfortunately limited to 24GB). Before being killed the memory use peaked at approx 22GB. See below for the output. The following code generates the example data: ```r nvert <- 10E6 nedge <- 20E7 vert <- data.frame( id = seq_len(nvert) ) edges <- data.frame( src = sample(nvert, nedge, replace = TRUE), dst = sample(nvert, nedge, replace = TRUE), type = sample(1:10, nedge, replace = TRUE) ) library(arrow) write_parquet(vert, "vertices.parquet") write_parquet(edges, "edges.parquet") ``` The following script processes this data and shows substantial differences between v20 and v17 (the production machine had v19 which showed the same behaviour are v20). ```r library(arrow) library(dplyr) sessionInfo() vert <- read_parquet("vertices.parquet") str(vert) con <- open_dataset("edges.parquet") con dta <- con |> filter(src %in% vert$id, dst %in% vert$id) |> collect() nrow(dta) ``` Below the results for v17: ``` $ /usr/bin/time -v R --no-save < test.R R version 4.5.0 (2025-04-11) -- "How About a Twenty-Six" Copyright (C) 2025 The R Foundation for Statistical Computing Platform: x86_64-pc-linux-gnu R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details. Natural language support but running in an English locale R is a collaborative project with many contributors. Type 'contributors()' for more information and 'citation()' on how to cite R or R packages in publications. Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R. > > > library(arrow) Attaching package: ‘arrow’ The following object is masked from ‘package:utils’: timestamp > library(dplyr) Attaching package: ‘dplyr’ The following objects are masked from ‘package:stats’: filter, lag The following objects are masked from ‘package:base’: intersect, setdiff, setequal, union > > sessionInfo() R version 4.5.0 (2025-04-11) Platform: x86_64-pc-linux-gnu Running under: Debian GNU/Linux 12 (bookworm) Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.11.0 LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.11.0 LAPACK version 3.11.0 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=nl_NL.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=nl_NL.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=nl_NL.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=nl_NL.UTF-8 LC_IDENTIFICATION=C time zone: Europe/Amsterdam tzcode source: system (glibc) attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] dplyr_1.1.4 arrow_17.0.0 loaded via a namespace (and not attached): [1] assertthat_0.2.1 R6_2.6.1 bit_4.6.0 tidyselect_1.2.1 [5] magrittr_2.0.3 glue_1.8.0 tibble_3.2.1 pkgconfig_2.0.3 [9] bit64_4.6.0-1 generics_0.1.3 lifecycle_1.0.4 cli_3.6.5 [13] vctrs_0.6.5 compiler_4.5.0 purrr_1.0.4 pillar_1.10.2 [17] rlang_1.1.6 > > vert <- read_parquet("vertices.parquet") > str(vert) tibble [10,000,000 × 1] (S3: tbl_df/tbl/data.frame) $ id: int [1:10000000] 1 2 3 4 5 6 7 8 9 10 ... > > con <- open_dataset("edges.parquet") > con FileSystemDataset with 1 Parquet file 3 columns src: int32 dst: int32 type: int32 See $metadata for additional Schema metadata > > dta <- con |> filter(src %in% vert$id, dst %in% vert$id) |> collect() > > nrow(dta) [1] 200000000 > > Command being timed: "R --no-save" User time (seconds): 30.19 System time (seconds): 2.74 Percent of CPU this job got: 306% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:10.75 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 7078692 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 443 Minor (reclaiming a frame) page faults: 61785 Voluntary context switches: 10143 Involuntary context switches: 10536 Swaps: 0 File system inputs: 3898808 File system outputs: 0 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0 ``` Below the results for v20: ``` $ /usr/bin/time -v R --no-save < test.R R version 4.5.0 (2025-04-11) -- "How About a Twenty-Six" Copyright (C) 2025 The R Foundation for Statistical Computing Platform: x86_64-pc-linux-gnu R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details. Natural language support but running in an English locale R is a collaborative project with many contributors. Type 'contributors()' for more information and 'citation()' on how to cite R or R packages in publications. Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R. > > > library(arrow) Attaching package: ‘arrow’ The following object is masked from ‘package:utils’: timestamp > library(dplyr) Attaching package: ‘dplyr’ The following objects are masked from ‘package:stats’: filter, lag The following objects are masked from ‘package:base’: intersect, setdiff, setequal, union > > sessionInfo() R version 4.5.0 (2025-04-11) Platform: x86_64-pc-linux-gnu Running under: Debian GNU/Linux 12 (bookworm) Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.11.0 LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.11.0 LAPACK version 3.11.0 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=nl_NL.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=nl_NL.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=nl_NL.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=nl_NL.UTF-8 LC_IDENTIFICATION=C time zone: Europe/Amsterdam tzcode source: system (glibc) attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] dplyr_1.1.4 arrow_20.0.0 loaded via a namespace (and not attached): [1] assertthat_0.2.1 R6_2.6.1 bit_4.6.0 tidyselect_1.2.1 [5] magrittr_2.0.3 glue_1.8.0 tibble_3.2.1 pkgconfig_2.0.3 [9] bit64_4.6.0-1 generics_0.1.3 lifecycle_1.0.4 cli_3.6.5 [13] vctrs_0.6.5 compiler_4.5.0 purrr_1.0.4 pillar_1.10.2 [17] rlang_1.1.6 > > vert <- read_parquet("vertices.parquet") > str(vert) tibble [10,000,000 × 1] (S3: tbl_df/tbl/data.frame) $ id: int [1:10000000] 1 2 3 4 5 6 7 8 9 10 ... > > con <- open_dataset("edges.parquet") > con FileSystemDataset with 1 Parquet file 3 columns src: int32 dst: int32 type: int32 See $metadata for additional Schema metadata > > dta <- con |> filter(src %in% vert$id, dst %in% vert$id) |> collect() Command terminated by signal 9 Command being timed: "R --no-save" User time (seconds): 50.57 System time (seconds): 8.59 Percent of CPU this job got: 78% Elapsed (wall clock) time (h:mm:ss or m:ss): 1:15.57 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 21869392 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 622 Minor (reclaiming a frame) page faults: 2185479 Voluntary context switches: 2431 Involuntary context switches: 1599 Swaps: 0 File system inputs: 140744 File system outputs: 0 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0 ``` So the run with v17 took 10s and a maximum of approx 7GB memory. The v20 run was killed after 1m16s because it ran out of memory (my home machine is unfortunately limited to 24GB). Before being killed the memory use peaked at approx 22GB. ### Component(s) R -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org