klin333 opened a new issue, #822: URL: https://github.com/apache/arrow-nanoarrow/issues/822
- Creating a data.frame directly from a nanoarrow_array_stream is much slower than first converting to an arrow::Table and then to data.frame. - I have tried to force ALTREP materialization in both paths for a fair comparison, and confirmed via `.Internal(inspect(x))`. - The key observation is that the direct nanoarrow_array_stream → data.frame() path materializes much slower than the Arrow Table path. - This is very similar to https://github.com/apache/arrow-adbc/issues/2508 . In fact, the workaround of going via arrow Table is inspired by that issue. However, even though that issue is closed, the reprex below for strings show significant performance difference persists. The elapsed time to fetch, convert and materialize 100k rows of random strings, of various column sizes, are shown below. The route via Arrow::Table scales about linearly, while the direct route from nanoarrow_arrow_stream to data.frame scales much worse than linearly. | num_cols | elapsed_with_arrow | elapsed_without_arrow | |----------:|-------------------:|----------------------:| | 10 | 2.4 secs | 1.3 secs | | 20 | 3.2 secs | 2.9 secs | | 40 | 6.1 secs | 7.0 secs | | 80 | 12.7 secs | 30.8 secs | | 160 | 26.9 secs | **920.7 secs** | ```r get_con <- function() { uri <- "secret" adbc_con <- adbcsnowflake::adbcsnowflake() |> adbcdrivermanager::adbc_database_init(uri = uri) |> adbcdrivermanager::adbc_connection_init() adbc_con } sweep_cols <- c(10, 20, 40, 80, 160) results_df <- purrr::map_dfr(sweep_cols, function(num_cols) { print(sprintf("num_cols = %i", num_cols)) # write table to snowflake. use separate R process to avoid common string pool callr::r(args = list(get_con = get_con, num_cols = num_cols), function(get_con, num_cols) { adbc_con <- get_con() num_rows <- 1e5 df <- lapply(seq(num_cols), function(jj) uuid::UUIDgenerate(n = num_rows)) |> rlang::set_names(paste0('x', seq(num_cols))) |> dplyr::bind_cols() lobstr::obj_size(df) adbcdrivermanager::execute_adbc(adbc_con, "DROP TABLE IF EXISTS LARGE1;") adbcdrivermanager::write_adbc(df, adbc_con, "LARGE1", mode = "default") NULL }) # read table with as_arrow_table, use separate R process to avoid common string pool elapsed_with_arrow <- callr::r(args = list(get_con = get_con), function(get_con) { adbc_con <- get_con() t1 <- Sys.time() df_read <- adbcdrivermanager::read_adbc(adbc_con, 'SELECT * from "LARGE1"') |> arrow::as_arrow_table() |> as.data.frame() .Internal(inspect(df_read[[1]])) # shows 16 STRSXP g0c0 [REF(65535)] arrow::array_string_vector<0x00000299299da350, string, 687 chunks, 0 nulls> len=100000 tibble_read <- tibble::as_tibble(df_read) # required in order for print to trigger ALTREP materialization .Internal(inspect(df_read[[1]])) # shows: 16 STRSXP g0c0 [REF(65535)] arrow::array_string_vector<0x00000299299da350, string, 687 chunks, 0 nulls> len=100000 print(tibble_read) # takes a bit of time. ALTREP materialization happens here .Internal(inspect(df_read[[1]])) # shows: 16 STRSXP g1c0 [MARK,REF(65535)] materialized arrow::array_string_vector len=100000 Sys.time() - t1 }) # read table without as_arrow_table, use separate R process to avoid common string pool. SLOW elapsed_without_arrow <- callr::r(args = list(get_con = get_con), function(get_con) { adbc_con <- get_con() t1 <- Sys.time() df_read <- adbcdrivermanager::read_adbc(adbc_con, 'SELECT * from "LARGE1"') |> as.data.frame() # slow: materialization happened here .Internal(inspect(df_read[[1]])) # shows STRSXP g1c7 [MARK,REF(1)] (len=100000, tl=0) ie fully materialized standard R vector tibble_read <- tibble::as_tibble(df_read) # instant print(tibble_read) # instant Sys.time() - t1 }) data.frame(num_cols, elapsed_with_arrow, elapsed_without_arrow) }) results_df # num_cols elapsed_with_arrow elapsed_without_arrow # 1 10 2.454586 secs 1.327180 secs # 2 20 3.211402 secs 2.942045 secs # 3 40 6.179819 secs 7.045094 secs # 4 80 12.713749 secs 30.818208 secs ``` ``` > library(arrow) > library(nanoarrow) > library(adbcdrivermanager) > library(adbcsnowflake) > sessionInfo() > sessionInfo() R version 4.5.2 (2025-10-31 ucrt) Platform: x86_64-w64-mingw32/x64 Running under: Windows 11 x64 (build 22631) Matrix products: default LAPACK version 3.12.1 locale: [1] LC_COLLATE=English_Australia.utf8 LC_CTYPE=English_Australia.utf8 [3] LC_MONETARY=English_Australia.utf8 LC_NUMERIC=C [5] LC_TIME=English_Australia.utf8 time zone: Australia/Sydney tzcode source: internal attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] arrow_22.0.0 adbcsnowflake_0.20.0 adbcdrivermanager_0.20.0 [4] nanoarrow_0.7.0-1 loaded via a namespace (and not attached): [1] assertthat_0.2.1 R6_2.6.1 bit_4.6.0 tidyselect_1.2.1 [5] magrittr_2.0.4 glue_1.8.0 tibble_3.3.0 pkgconfig_2.0.3 [9] bit64_4.6.0-1 dplyr_1.1.4 generics_0.1.4 lifecycle_1.0.4 [13] ps_1.9.1 cli_3.6.5 processx_3.8.6 callr_3.7.6 [17] vctrs_0.6.5 compiler_4.5.2 purrr_1.1.0 pillar_1.11.1 [21] rlang_1.1.6 ``` Suggestions from ChatGPT (I have absolutely no idea if it's sensible): - Replicate Arrow's faster materialization algorithm inside nanoarrow (C/C++), focusing on: - Chunked processing of record batches (avoid per‑row overhead across the whole table). - Minimized PROTECT/UNPROTECT frequency (allocate target STRSXP once per column; protect vectors, not each element). - Avoid repeated re‑encoding: call Rf_mkCharLenCE(..., CE_UTF8) using direct string buffer/offsets; bypass translateCharUTF8 in tight loops. - Efficient null handling (bitmap/offset reads with branch‑light loops). - Optional parallel chunk materialization for very wide tables (behind an option; respect R’s thread‑safety constraints). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
