djvanderlaan opened a new issue, #46428:
URL: https://github.com/apache/arrow/issues/46428

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   I noticed that some operations use substantially slower and use more memory 
under arrow V20.0.0 and v19.0.0 than under v17.0.0.  I managed to reduce the 
example and am able to reproduce this both on a production machine running 
ubuntu 22.04 and my home desktop (running debian stable). 
   
   The run of the example on my desktop with v17 took 10s and a maximum of 
approx 7GB memory.  The v20 run was killed after 1m16s because it ran out of 
memory (my home machine is unfortunately limited to 24GB). Before being killed 
the memory use peaked at approx 22GB. See below for the output. 
   
   
   The following code generates the example data:
   ```r
   nvert <- 10E6
   nedge <- 20E7
   
   vert <- data.frame(
       id = seq_len(nvert)
     )
   
   edges <- data.frame(
       src = sample(nvert, nedge, replace = TRUE),
       dst = sample(nvert, nedge, replace = TRUE),
       type = sample(1:10, nedge, replace = TRUE)
     )
   
   library(arrow)
   write_parquet(vert, "vertices.parquet")
   write_parquet(edges, "edges.parquet")
   ```
   
   The following script processes this data and shows substantial differences 
between v20 and v17 (the production machine had v19 which showed the same 
behaviour are v20). 
   
   ```r
   
   library(arrow)
   library(dplyr)
   
   sessionInfo()
   
   vert <- read_parquet("vertices.parquet")
   str(vert)
   
   con <- open_dataset("edges.parquet")
   con
   
   dta <- con |> filter(src %in% vert$id, dst %in% vert$id) |> collect()
   
   nrow(dta)
   ```
   
   Below the results for v17:
   
   ```
   $ /usr/bin/time -v R --no-save < test.R
   
   R version 4.5.0 (2025-04-11) -- "How About a Twenty-Six"
   Copyright (C) 2025 The R Foundation for Statistical Computing
   Platform: x86_64-pc-linux-gnu
   
   R is free software and comes with ABSOLUTELY NO WARRANTY.
   You are welcome to redistribute it under certain conditions.
   Type 'license()' or 'licence()' for distribution details.
   
     Natural language support but running in an English locale
   
   R is a collaborative project with many contributors.
   Type 'contributors()' for more information and
   'citation()' on how to cite R or R packages in publications.
   
   Type 'demo()' for some demos, 'help()' for on-line help, or
   'help.start()' for an HTML browser interface to help.
   Type 'q()' to quit R.
   
   > 
   > 
   > library(arrow)
   
   Attaching package: ‘arrow’
   
   The following object is masked from ‘package:utils’:
   
       timestamp
   
   > library(dplyr)
   
   Attaching package: ‘dplyr’
   
   The following objects are masked from ‘package:stats’:
   
       filter, lag
   
   The following objects are masked from ‘package:base’:
   
       intersect, setdiff, setequal, union
   
   > 
   > sessionInfo()
   R version 4.5.0 (2025-04-11)
   Platform: x86_64-pc-linux-gnu
   Running under: Debian GNU/Linux 12 (bookworm)
   
   Matrix products: default
   BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.11.0 
   LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.11.0  LAPACK version 
3.11.0
   
   locale:
    [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
    [3] LC_TIME=nl_NL.UTF-8        LC_COLLATE=en_US.UTF-8    
    [5] LC_MONETARY=nl_NL.UTF-8    LC_MESSAGES=en_US.UTF-8   
    [7] LC_PAPER=nl_NL.UTF-8       LC_NAME=C                 
    [9] LC_ADDRESS=C               LC_TELEPHONE=C            
   [11] LC_MEASUREMENT=nl_NL.UTF-8 LC_IDENTIFICATION=C       
   
   time zone: Europe/Amsterdam
   tzcode source: system (glibc)
   
   attached base packages:
   [1] stats     graphics  grDevices utils     datasets  methods   base     
   
   other attached packages:
   [1] dplyr_1.1.4  arrow_17.0.0
   
   loaded via a namespace (and not attached):
    [1] assertthat_0.2.1 R6_2.6.1         bit_4.6.0        tidyselect_1.2.1
    [5] magrittr_2.0.3   glue_1.8.0       tibble_3.2.1     pkgconfig_2.0.3 
    [9] bit64_4.6.0-1    generics_0.1.3   lifecycle_1.0.4  cli_3.6.5       
   [13] vctrs_0.6.5      compiler_4.5.0   purrr_1.0.4      pillar_1.10.2   
   [17] rlang_1.1.6     
   > 
   > vert <- read_parquet("vertices.parquet")
   > str(vert)
   tibble [10,000,000 × 1] (S3: tbl_df/tbl/data.frame)
    $ id: int [1:10000000] 1 2 3 4 5 6 7 8 9 10 ...
   > 
   > con <- open_dataset("edges.parquet")
   > con
   FileSystemDataset with 1 Parquet file
   3 columns
   src: int32
   dst: int32
   type: int32
   
   See $metadata for additional Schema metadata
   > 
   > dta <- con |> filter(src %in% vert$id, dst %in% vert$id) |> collect()
   > 
   > nrow(dta)
   [1] 200000000
   > 
   > 
        Command being timed: "R --no-save"
        User time (seconds): 30.19
        System time (seconds): 2.74
        Percent of CPU this job got: 306%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:10.75
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 7078692
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 443
        Minor (reclaiming a frame) page faults: 61785
        Voluntary context switches: 10143
        Involuntary context switches: 10536
        Swaps: 0
        File system inputs: 3898808
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0
   ```
   
   Below the results for v20:
   
   ```
   $ /usr/bin/time -v R --no-save < test.R
   
   R version 4.5.0 (2025-04-11) -- "How About a Twenty-Six"
   Copyright (C) 2025 The R Foundation for Statistical Computing
   Platform: x86_64-pc-linux-gnu
   
   R is free software and comes with ABSOLUTELY NO WARRANTY.
   You are welcome to redistribute it under certain conditions.
   Type 'license()' or 'licence()' for distribution details.
   
     Natural language support but running in an English locale
   
   R is a collaborative project with many contributors.
   Type 'contributors()' for more information and
   'citation()' on how to cite R or R packages in publications.
   
   Type 'demo()' for some demos, 'help()' for on-line help, or
   'help.start()' for an HTML browser interface to help.
   Type 'q()' to quit R.
   
   > 
   > 
   > library(arrow)
   
   Attaching package: ‘arrow’
   
   The following object is masked from ‘package:utils’:
   
       timestamp
   
   > library(dplyr)
   
   Attaching package: ‘dplyr’
   
   The following objects are masked from ‘package:stats’:
   
       filter, lag
   
   The following objects are masked from ‘package:base’:
   
       intersect, setdiff, setequal, union
   
   > 
   > sessionInfo()
   R version 4.5.0 (2025-04-11)
   Platform: x86_64-pc-linux-gnu
   Running under: Debian GNU/Linux 12 (bookworm)
   
   Matrix products: default
   BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.11.0 
   LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.11.0  LAPACK version 
3.11.0
   
   locale:
    [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
    [3] LC_TIME=nl_NL.UTF-8        LC_COLLATE=en_US.UTF-8    
    [5] LC_MONETARY=nl_NL.UTF-8    LC_MESSAGES=en_US.UTF-8   
    [7] LC_PAPER=nl_NL.UTF-8       LC_NAME=C                 
    [9] LC_ADDRESS=C               LC_TELEPHONE=C            
   [11] LC_MEASUREMENT=nl_NL.UTF-8 LC_IDENTIFICATION=C       
   
   time zone: Europe/Amsterdam
   tzcode source: system (glibc)
   
   attached base packages:
   [1] stats     graphics  grDevices utils     datasets  methods   base     
   
   other attached packages:
   [1] dplyr_1.1.4  arrow_20.0.0
   
   loaded via a namespace (and not attached):
    [1] assertthat_0.2.1 R6_2.6.1         bit_4.6.0        tidyselect_1.2.1
    [5] magrittr_2.0.3   glue_1.8.0       tibble_3.2.1     pkgconfig_2.0.3 
    [9] bit64_4.6.0-1    generics_0.1.3   lifecycle_1.0.4  cli_3.6.5       
   [13] vctrs_0.6.5      compiler_4.5.0   purrr_1.0.4      pillar_1.10.2   
   [17] rlang_1.1.6     
   > 
   > vert <- read_parquet("vertices.parquet")
   > str(vert)
   tibble [10,000,000 × 1] (S3: tbl_df/tbl/data.frame)
    $ id: int [1:10000000] 1 2 3 4 5 6 7 8 9 10 ...
   > 
   > con <- open_dataset("edges.parquet")
   > con
   FileSystemDataset with 1 Parquet file
   3 columns
   src: int32
   dst: int32
   type: int32
   
   See $metadata for additional Schema metadata
   > 
   > dta <- con |> filter(src %in% vert$id, dst %in% vert$id) |> collect()
   Command terminated by signal 9
        Command being timed: "R --no-save"
        User time (seconds): 50.57
        System time (seconds): 8.59
        Percent of CPU this job got: 78%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 1:15.57
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 21869392
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 622
        Minor (reclaiming a frame) page faults: 2185479
        Voluntary context switches: 2431
        Involuntary context switches: 1599
        Swaps: 0
        File system inputs: 140744
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0
   ```
   
   So the run with v17 took 10s and a maximum of approx 7GB memory.  The v20 
run was killed after 1m16s because it ran out of memory (my home machine is 
unfortunately limited to 24GB). Before being killed the memory use peaked at 
approx 22GB. 
   
   
   
   ### Component(s)
   
   R


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to