rafapereirabr opened a new issue, #45872: URL: https://github.com/apache/arrow/issues/45872
### Describe the bug, including details regarding any error messages, version, and platform. # Problem I'm using {arrow} as a dependency in my package [{geocodebr}](https://ipeagit.github.io/geocodebr/). In one particular case, I use {arrow} to keep only the distinct rows of a large table originally with `50,240,061` rows. When run the code of the reprex (see below), it crashes Rstudio. When I use this code inside my package, though, it throws the following error messag, which seems to come from an error in C++: > terminate called after throwing an instance of 'std::length_error' what(): vector::_M_default_append ps. I just wanted to add that {arrow} is an incredible package and that the R community really appreciates your work on it ! Thanks ! # Reprex ``` remotes::install_github("ipeaGIT/geocodebr") library(geocodebr) library(dplyr) library(arrow) # download parquet file geocodebr::download_cnefe(tabela = 'municipio_logradouro_numero_cep_localidade') path_to_parquet <- geocodebr::listar_dados_cache()[1] key_cols <- c("estado", "municipio", "logradouro", "numero", "cep" ) unique_logradouros <- arrow::open_dataset(path_to_parquet) |> dplyr::select(dplyr::all_of(key_cols)) |> dplyr::distinct() |> dplyr::compute() ## remove files from cache # geocodebr::deletar_pasta_cache() ``` I'm running this on a Windows machine with 250GB of RAM. # Environment I'm using the latest version of {arrow} in a Windows OS. See below. ``` > sessionInfo() R version 4.3.2 (2023-10-31 ucrt) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows Server 2022 x64 (build 20348) Matrix products: default locale: [1] LC_COLLATE=English_United States.utf8 LC_CTYPE=English_United States.utf8 [3] LC_MONETARY=English_United States.utf8 LC_NUMERIC=C [5] LC_TIME=English_United States.utf8 time zone: America/Sao_Paulo tzcode source: internal attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] arrow_19.0.1 dplyr_1.1.4 geocodebr_0.2.0 testthat_3.2.3 loaded via a namespace (and not attached): [1] rappdirs_0.3.3 generics_0.1.3 class_7.3-22 KernSmooth_2.23-22 digest_0.6.37 [6] magrittr_2.0.3 grid_4.3.2 nanoarrow_0.6.0 pkgload_1.4.0 fastmap_1.2.0 [11] rprojroot_2.0.4 pkgbuild_1.4.6 sessioninfo_1.2.3 e1071_1.7-16 backports_1.5.0 [16] brio_1.1.5 DBI_1.2.3 urlchecker_1.0.1 promises_1.3.2 purrr_1.0.4 [21] httr2_1.1.1 duckdb_1.2.0 cli_3.6.4 shiny_1.10.0 rlang_1.1.5 [26] units_0.8-7 ellipsis_0.3.2 bit64_4.6.0-1 remotes_2.5.0 withr_3.0.2 [31] cachem_1.1.0 devtools_2.4.5 parallel_4.3.2 tools_4.3.2 sfheaders_0.4.4 [36] memoise_2.0.1 checkmate_2.3.2 httpuv_1.6.15 curl_6.2.1 assertthat_0.2.1 [41] vctrs_0.6.5 R6_2.6.1 mime_0.12 proxy_0.4-27 classInt_0.4-11 [46] lifecycle_1.0.4 bit_4.6.0 fs_1.6.5 htmlwidgets_1.6.4 usethis_3.1.0 [51] miniUI_0.1.1.1 pkgconfig_2.0.3 desc_1.4.3 pillar_1.10.1 later_1.4.1 [56] data.table_1.17.0 glue_1.8.0 profvis_0.4.0 Rcpp_1.0.14 sf_1.0-19 [61] tibble_3.2.1 tidyselect_1.2.1 rstudioapi_0.17.1 xtable_1.8-4 htmltools_0.5.8.1 [66] compiler_4.3.2 enderecobr_0.4.1 ``` ### Component(s) R, C++ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org