[I] R: Benchmarking ADBC (Snowflake) vs ODBC shows issue with larger datasets [arrow-adbc]

via GitHub Sun, 09 Feb 2025 19:15:21 -0800


JavOrraca opened a new issue, #2508:
URL: https://github.com/apache/arrow-adbc/issues/2508


   ### What would you like help with?
   
   After connecting to Snowflake with R's v0.16.0 
[`adbcsnowflake`](https://arrow.apache.org/adbc/current/r/adbcsnowflake/index.html)
 and 
[`adbcdrivermanager`](https://arrow.apache.org/adbc/current/r/adbcdrivermanager/))
 I benchmarked query + download times using ADBC vs ODBC.
   
   This could be a case of user error or bad setup. My benchmarking (see 
screenshot below) tested ADBC vs ODBC on the same observations and columns, 
iterating over different combinations of row counts and columns.
   
   
![Image](https://github.com/user-attachments/assets/85fdb643-b950-404f-9d12-6acaf896fd83)
   
   For the ODBC method, I established a connection to Snowflake with `DBI` and 
then used `dplyr` + `dbplyr` to interact with Snowflake, finally collecting the 
results as a tibble via `dplyr::collect()`. For the ADBC method, I used the 
connection string below and collected my results to a tibble using 
`tibble::as_tibble()` as shown on the 
[`adbcsnowflake`](https://arrow.apache.org/adbc/current/r/adbcsnowflake/index.html)
 website.
   
   ```{r}
   # Not shown here is the ODBC connection string using {DBI} but
   # it relies on the same inputs as the ADBC inputs, below, and the
   # ODBC connection was assigned to the object "conn_odbc"
   
   library(adbcdrivermanager)
   
   db <- adbc_database_init(
     adbcsnowflake::adbcsnowflake(),
     username = Sys.getenv("ldap_username"),
     adbc.snowflake.sql.account = Sys.getenv("sf_account"),
     adbc.snowflake.sql.uri.protocol = "https",
     adbc.snowflake.sql.uri.host = Sys.getenv("sf_host"),
     adbc.snowflake.sql.uri.port = "443",
     adbc.snowflake.sql.auth_type = "auth_oauth",
     adbc.snowflake.sql.client_option.auth_token = Sys.getenv("sf_auth_token"),
     adbc.snowflake.sql.role = Sys.getenv("sf_role"),
     adbc.snowflake.sql.warehouse = Sys.getenv("sf_warehouse"),
     adbc.snowflake.sql.database = Sys.getenv("sf_database")
   )
   
   conn_adbc <- adbc_connection_init(db)
   
   # Benchmarking psuedo-code example
   
   library(bench)
   library(dplyr)
   library(dbplyr)
   
   benchmarking_odbc_vs_adbc <- bench::mark(
     min iterations = 5,
     ODBC_1000_5_col = conn_odbc |>
       dplyr::tbl(I("CATALOG.SCHEMA.TABLE_NAME")) |>
       dplyr::select(name, age, tenure, year_yyyy, month_mm) |>
       dplyr::filter(
         year_yyyy == 2024,
         month_mm == 12
       ) |>
       head(1000) |>
       dplyr::collect(),
     ADBC_1000_5_col = conn_adbc |>
       adbcdrivermanager::read_adbc(
         "SELECT TOP 1000 col_1, col_2, col_3, col_4, col_5
          FROM CATALOG.SCHEMA.TABLE_NAME
          WHERE year_yyyy = 2024
          AND month_mm = 12"
       ) |>
       tibble::as_tibble(),
     check = FALSE
   )
   ```
   
   As indicated on my screenshot above, my results show that ADBC performs 
better than ODBC on the smaller data sets but on the largest of data pulls (1 
million rows and 75 columns), ODBC performs better. I ran 5 iterations for each 
combination of rows and columns and forn the largest data set, ADBC takes ~57 
minutes to pull the data while ODBC takes ~50 seconds. There could be a number 
of reasons for this (e.g., user error, or driver config setup, or something 
even Snowflake side) but if anything on the surface jumps out at what I should 
be doing differently, any pointers would be appreciated.
   
   Thanks so much for your help and awesome work with Arrow and these drivers.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] R: Benchmarking ADBC (Snowflake) vs ODBC shows issue with larger datasets [arrow-adbc]

Reply via email to