rafapereirabr opened a new issue, #45872:
URL: https://github.com/apache/arrow/issues/45872

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   # Problem
   
   I'm using {arrow} as a dependency in my package 
[{geocodebr}](https://ipeagit.github.io/geocodebr/). In one particular case, I 
use {arrow} to keep only the distinct rows of a large table originally with 
`50,240,061` rows. When run the code of the reprex (see below), it crashes 
Rstudio.
   
   When I use this code inside my package, though, it throws the following 
error messag, which seems to come from an error in C++:
   
   > terminate called after throwing an instance of 'std::length_error'
     what():  vector::_M_default_append
   
   ps. I just wanted to add that {arrow} is an incredible package and that the 
R community really appreciates your work on it ! Thanks !
   
   # Reprex
   ```
   remotes::install_github("ipeaGIT/geocodebr")
   
   library(geocodebr)
   library(dplyr)
   library(arrow)
   
   # download parquet file
   geocodebr::download_cnefe(tabela = 
'municipio_logradouro_numero_cep_localidade')
   
   path_to_parquet <- geocodebr::listar_dados_cache()[1]
   
   key_cols <- c("estado", "municipio", "logradouro", "numero", "cep" )
   
   unique_logradouros <- arrow::open_dataset(path_to_parquet) |>
     dplyr::select(dplyr::all_of(key_cols)) |>
     dplyr::distinct() |>
     dplyr::compute()
   
   
   ## remove files from cache
   # geocodebr::deletar_pasta_cache()
   ```
   
   I'm running this on a Windows machine with 250GB of RAM.
   
   # Environment
   
   I'm using the latest version of {arrow} in a Windows OS. See below.
   
   ```
   > sessionInfo()
   R version 4.3.2 (2023-10-31 ucrt)
   Platform: x86_64-w64-mingw32/x64 (64-bit)
   Running under: Windows Server 2022 x64 (build 20348)
   
   Matrix products: default
   
   
   locale:
   [1] LC_COLLATE=English_United States.utf8  LC_CTYPE=English_United 
States.utf8   
   [3] LC_MONETARY=English_United States.utf8 LC_NUMERIC=C                      
    
   [5] LC_TIME=English_United States.utf8    
   
   time zone: America/Sao_Paulo
   tzcode source: internal
   
   attached base packages:
   [1] stats     graphics  grDevices utils     datasets  methods   base     
   
   other attached packages:
   [1] arrow_19.0.1    dplyr_1.1.4     geocodebr_0.2.0 testthat_3.2.3 
   
   loaded via a namespace (and not attached):
    [1] rappdirs_0.3.3     generics_0.1.3     class_7.3-22       
KernSmooth_2.23-22 digest_0.6.37     
    [6] magrittr_2.0.3     grid_4.3.2         nanoarrow_0.6.0    pkgload_1.4.0  
    fastmap_1.2.0     
   [11] rprojroot_2.0.4    pkgbuild_1.4.6     sessioninfo_1.2.3  e1071_1.7-16   
    backports_1.5.0   
   [16] brio_1.1.5         DBI_1.2.3          urlchecker_1.0.1   promises_1.3.2 
    purrr_1.0.4       
   [21] httr2_1.1.1        duckdb_1.2.0       cli_3.6.4          shiny_1.10.0   
    rlang_1.1.5       
   [26] units_0.8-7        ellipsis_0.3.2     bit64_4.6.0-1      remotes_2.5.0  
    withr_3.0.2       
   [31] cachem_1.1.0       devtools_2.4.5     parallel_4.3.2     tools_4.3.2    
    sfheaders_0.4.4   
   [36] memoise_2.0.1      checkmate_2.3.2    httpuv_1.6.15      curl_6.2.1     
    assertthat_0.2.1  
   [41] vctrs_0.6.5        R6_2.6.1           mime_0.12          proxy_0.4-27   
    classInt_0.4-11   
   [46] lifecycle_1.0.4    bit_4.6.0          fs_1.6.5           
htmlwidgets_1.6.4  usethis_3.1.0     
   [51] miniUI_0.1.1.1     pkgconfig_2.0.3    desc_1.4.3         pillar_1.10.1  
    later_1.4.1       
   [56] data.table_1.17.0  glue_1.8.0         profvis_0.4.0      Rcpp_1.0.14    
    sf_1.0-19         
   [61] tibble_3.2.1       tidyselect_1.2.1   rstudioapi_0.17.1  xtable_1.8-4   
    htmltools_0.5.8.1 
   [66] compiler_4.3.2     enderecobr_0.4.1  
   ```
   
   ### Component(s)
   
   R, C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to