[I] slow nanoarrow_array_stream to data.frame for strings; slow ALTREP materialization [arrow-nanoarrow]

via GitHub Tue, 04 Nov 2025 23:22:34 -0800


klin333 opened a new issue, #822:
URL: https://github.com/apache/arrow-nanoarrow/issues/822


   
   - Creating a data.frame directly from a nanoarrow_array_stream is much 
slower than first converting to an arrow::Table and then to data.frame. 
   - I have tried to force ALTREP materialization in both paths for a fair 
comparison, and confirmed via `.Internal(inspect(x))`.  
   - The key observation is that the direct nanoarrow_array_stream → 
data.frame() path materializes much slower than the Arrow Table path.
   - This is very similar to https://github.com/apache/arrow-adbc/issues/2508 . 
In fact, the workaround of going via arrow Table is inspired by that issue. 
However, even though that issue is closed, the reprex below for strings show 
significant performance difference persists.
   
   The elapsed time to fetch, convert and materialize 100k rows of random 
strings, of various column sizes, are shown below. The route via Arrow::Table 
scales about linearly, while the direct route from nanoarrow_arrow_stream to 
data.frame scales much worse than linearly. 
   
   | num_cols | elapsed_with_arrow | elapsed_without_arrow |
   |----------:|-------------------:|----------------------:|
   | 10        | 2.4 secs      | 1.3 secs         |
   | 20        | 3.2 secs      | 2.9 secs         |
   | 40        | 6.1 secs      | 7.0 secs         |
   | 80        | 12.7 secs     | 30.8 secs        |
   | 160      | 26.9 secs       | **920.7 secs**          |
   
   ```r
   get_con <- function() {
     uri <- "secret"
     adbc_con <- adbcsnowflake::adbcsnowflake() |>
       adbcdrivermanager::adbc_database_init(uri = uri) |>
       adbcdrivermanager::adbc_connection_init()
     adbc_con
   }
   
   sweep_cols <- c(10, 20, 40, 80, 160)
   results_df <- purrr::map_dfr(sweep_cols, function(num_cols) {
     print(sprintf("num_cols = %i", num_cols))
     
     # write table to snowflake. use separate R process to avoid common string 
pool
     callr::r(args = list(get_con = get_con, num_cols = num_cols), 
function(get_con, num_cols) {
       adbc_con <- get_con()
       num_rows <- 1e5
       df <- lapply(seq(num_cols), function(jj) uuid::UUIDgenerate(n = 
num_rows)) |> 
         rlang::set_names(paste0('x', seq(num_cols))) |>  
         dplyr::bind_cols()
       lobstr::obj_size(df)
       
       adbcdrivermanager::execute_adbc(adbc_con, "DROP TABLE IF EXISTS LARGE1;")
       adbcdrivermanager::write_adbc(df, adbc_con, "LARGE1", mode = "default")
       NULL
     })
     
     # read table with as_arrow_table, use separate R process to avoid common 
string pool
     elapsed_with_arrow <- callr::r(args = list(get_con = get_con), 
function(get_con) {
       adbc_con <- get_con()
       t1 <- Sys.time()
       df_read <- adbcdrivermanager::read_adbc(adbc_con, 'SELECT * from 
"LARGE1"') |> 
         arrow::as_arrow_table() |>
         as.data.frame()
       .Internal(inspect(df_read[[1]]))          # shows 16 STRSXP g0c0 
[REF(65535)] arrow::array_string_vector<0x00000299299da350, string, 687 chunks, 
0 nulls> len=100000
       tibble_read <- tibble::as_tibble(df_read) # required in order for print 
to trigger ALTREP materialization
       .Internal(inspect(df_read[[1]]))          # shows: 16 STRSXP g0c0 
[REF(65535)] arrow::array_string_vector<0x00000299299da350, string, 687 chunks, 
0 nulls> len=100000
       print(tibble_read)                        # takes a bit of time. ALTREP 
materialization happens here
       .Internal(inspect(df_read[[1]]))          # shows: 16 STRSXP g1c0 
[MARK,REF(65535)] materialized arrow::array_string_vector len=100000
       Sys.time() - t1
     })
     
     # read table without as_arrow_table, use separate R process to avoid 
common string pool. SLOW
     elapsed_without_arrow <- callr::r(args = list(get_con = get_con), 
function(get_con) {
       adbc_con <- get_con()
       t1 <- Sys.time()
       df_read <- adbcdrivermanager::read_adbc(adbc_con, 'SELECT * from 
"LARGE1"') |> 
         as.data.frame()                         # slow: materialization 
happened here
       .Internal(inspect(df_read[[1]]))          # shows STRSXP g1c7 
[MARK,REF(1)] (len=100000, tl=0) ie fully materialized standard R vector
       tibble_read <- tibble::as_tibble(df_read) # instant
       print(tibble_read)                        # instant
       Sys.time() - t1
     })
     
     data.frame(num_cols, elapsed_with_arrow, elapsed_without_arrow)
   })
   
   results_df
   # num_cols elapsed_with_arrow elapsed_without_arrow
   # 1       10      2.454586 secs         1.327180 secs
   # 2       20      3.211402 secs         2.942045 secs
   # 3       40      6.179819 secs         7.045094 secs
   # 4       80     12.713749 secs        30.818208 secs
   ```
   
   ```
   > library(arrow)
   > library(nanoarrow)
   > library(adbcdrivermanager)
   > library(adbcsnowflake)
   > sessionInfo()
   > sessionInfo()
   R version 4.5.2 (2025-10-31 ucrt)
   Platform: x86_64-w64-mingw32/x64
   Running under: Windows 11 x64 (build 22631)
   
   Matrix products: default
     LAPACK version 3.12.1
   
   locale:
   [1] LC_COLLATE=English_Australia.utf8  LC_CTYPE=English_Australia.utf8
   [3] LC_MONETARY=English_Australia.utf8 LC_NUMERIC=C
   [5] LC_TIME=English_Australia.utf8
   
   time zone: Australia/Sydney
   tzcode source: internal
   
   attached base packages:
   [1] stats     graphics  grDevices utils     datasets  methods   base
   
   other attached packages:
   [1] arrow_22.0.0             adbcsnowflake_0.20.0     
adbcdrivermanager_0.20.0
   [4] nanoarrow_0.7.0-1
   
   loaded via a namespace (and not attached):
    [1] assertthat_0.2.1 R6_2.6.1         bit_4.6.0        tidyselect_1.2.1
    [5] magrittr_2.0.4   glue_1.8.0       tibble_3.3.0     pkgconfig_2.0.3
    [9] bit64_4.6.0-1    dplyr_1.1.4      generics_0.1.4   lifecycle_1.0.4
   [13] ps_1.9.1         cli_3.6.5        processx_3.8.6   callr_3.7.6
   [17] vctrs_0.6.5      compiler_4.5.2   purrr_1.1.0      pillar_1.11.1
   [21] rlang_1.1.6
   ```
   
   Suggestions from ChatGPT (I have absolutely no idea if it's sensible):
   - Replicate Arrow's faster materialization algorithm inside nanoarrow 
(C/C++), focusing on:
   - Chunked processing of record batches (avoid per‑row overhead across the 
whole table).
   - Minimized PROTECT/UNPROTECT frequency (allocate target STRSXP once per 
column; protect vectors, not each element).
   - Avoid repeated re‑encoding: call Rf_mkCharLenCE(..., CE_UTF8) using direct 
string buffer/offsets; bypass translateCharUTF8 in tight loops.
   - Efficient null handling (bitmap/offset reads with branch‑light loops).
   - Optional parallel chunk materialization for very wide tables (behind an 
option; respect R’s thread‑safety constraints).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] slow nanoarrow_array_stream to data.frame for strings; slow ALTREP materialization [arrow-nanoarrow]

Reply via email to