mrd0ll4r opened a new issue, #46983:
URL: https://github.com/apache/arrow/issues/46983

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   Hello, it's me again :)
   
   Not with a big-data crash today, but a different oddity.
   I have a large-ish parquet dataset, which I process with multiple dplyr 
verbs, including `inner_join`s.
   I used to do this in "plain" Arrow, but wanted to try `to_duckdb` for 
performance (and also because it's cool).
   I join the dataset to a list of characters and later to a tibble.
   `join_by` complains if the backends are not the same (I could provide 
`copy=TRUE`), so I decided to move those to DuckDB, too.
   
   However, I noticed that when I execute multiple queries with `inner_join(foo 
%>% to_duckdb(), ...)`s,
   DuckDB sometimes doesn't find the referenced table. It's flaky.
   
   This is the error:
   ```R
   Error in `collect()`:
   ! Failed to collect lazy table.
   Caused by error in `dbSendQuery()`:
   ! rapi_prepare: Failed to prepare query SELECT reason, SUM(n) AS n
   FROM (
     SELECT LHS.*, reason
     FROM (
       SELECT cid, COUNT(*) AS n
       FROM (
         SELECT arrow_107.*
         FROM arrow_107
         INNER JOIN arrow_108
           ON (arrow_107.cid = arrow_108.cid)
       ) q01
       GROUP BY cid
     ) LHS
     LEFT JOIN arrow_109
       ON (LHS.cid = arrow_109.cid)
   ) q01
   GROUP BY reason
   Error: Catalog Error: Table with name arrow_108 does not exist!
   Did you mean "pg_proc"?
   
   LINE 9:       INNER JOIN arrow_108
                            ^
   Run `rlang::last_trace()` to see where the error occurred.
   > rlang::last_trace()
   <error/rlang_error>
   Error in `collect()`:
   ! Failed to collect lazy table.
   Caused by error in `dbSendQuery()`:
   ! rapi_prepare: Failed to prepare query SELECT reason, SUM(n) AS n
   FROM (
     SELECT LHS.*, reason
     FROM (
       SELECT cid, COUNT(*) AS n
       FROM (
         SELECT arrow_107.*
         FROM arrow_107
         INNER JOIN arrow_108
           ON (arrow_107.cid = arrow_108.cid)
       ) q01
       GROUP BY cid
     ) LHS
     LEFT JOIN arrow_109
       ON (LHS.cid = arrow_109.cid)
   ) q01
   GROUP BY reason
   Error: Catalog Error: Table with name arrow_108 does not exist!
   Did you mean "pg_proc"?
   
   LINE 9:       INNER JOIN arrow_108
                            ^
   ---
   Backtrace:
       ▆
    1. ├─... %>% collect()
    2. ├─dplyr::collect(.)
    3. └─dbplyr:::collect.tbl_sql(.)
    4.   ├─base::withCallingHandlers(...)
    5.   ├─dbplyr::db_collect(...)
    6.   └─dbplyr:::db_collect.DBIConnection(...)
    7.     ├─DBI::dbSendQuery(con, sql)
    8.     └─duckdb::dbSendQuery(con, sql)
   ```
   
   The particularly weird thing is that it usually works the first time,
   but fails when I execute it multiple times one after the other.
   Is there some GC mechanism at work maybe that kicks in during 
preparation/execution of the second run?
   
   Reproducer:
   ```R
   
   # Probably unrelated, but this is how Arrow was compiled:
   Sys.setenv("ARROW_WITH_ZLIB"="ON")
   Sys.setenv("ARROW_PARQUET" = "ON")
   Sys.setenv("ARROW_WITH_ZSTD" = "ON")
   Sys.setenv("ARROW_MIMALLOC" = "ON")
   Sys.setenv("LIBARROW_MINIMAL" = FALSE)
   Sys.setenv("LIBARROW_BINARY" = FALSE)
   Sys.setenv("ARROW_R_DEV" = TRUE)
   Sys.setenv(MAKEFLAGS = sprintf("-j%d", parallel::detectCores()))
   options(renv.config.pak.enabled = TRUE)
   
   # Dependencies:
   library(arrow)
   library(dplyr)
   # I might have forgotten a package here, this is part of a larger project
   
   foo1 <- arrow_table(cid = c("a","b","c", "d", "a"), peer = 
c("1","2","3","2","2"))
   foo2 <- c("b","c")
   foo3 <- tibble(cid=c("a","b","c","d"), reason=c("something","something 
else","something else","something"))
   
   # This fails most of the time.
   # In particular if you execute it multiple times quickly in succession.
   foo1 %>%
     to_duckdb() %>%
     inner_join(arrow_table(tibble(cid=foo2)) %>% to_duckdb(), by=join_by(cid)) 
%>%
     group_by(cid) %>%
     tally() %>%
     left_join(arrow_table(foo3) %>% to_duckdb(), by=join_by(cid)) %>%
     group_by(reason) %>%
     summarize(n=sum(n)) %>%
     collect()
   
   # This works more often, but still fails sometimes:
   foo1 %>%
     to_duckdb() %>%
     inner_join(arrow_table(tibble(cid=foo2)) %>% to_duckdb(), by=join_by(cid)) 
%>%
     group_by(cid) %>%
     tally() %>%
     left_join(arrow_table(foo3) %>% to_duckdb(), by=join_by(cid)) %>%
     collect()
   
   # Pulling out the inline to_duckdb calls makes it so that it seemingly 
always works:
   tmp1 <- arrow_table(tibble(cid=foo2)) %>% to_duckdb()
   tmp2 <- arrow_table(foo3) %>% to_duckdb()
   
   foo1 %>%
     to_duckdb() %>%
     inner_join(tmp1, by=join_by(cid)) %>%
     group_by(cid) %>%
     tally() %>%
     left_join(tmp2, by=join_by(cid)) %>%
     group_by(reason) %>%
     summarize(n=sum(n)) %>%
     collect()
   ```
   
   
   ### Additional Info
   
   #### Machine Overview:
   ```
   Memory: 378 GB
   CPU: 64x Intel(R) Xeon(R) Gold 6154
   OS: Debian 12
   ```
   
   #### R `sessionInfo()`
   ```R
   > sessionInfo()
   R version 4.5.0 (2025-04-11)
   Platform: x86_64-pc-linux-gnu
   Running under: Debian GNU/Linux 12 (bookworm)
   
   Matrix products: default
   BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.11.0 
   LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.11.0  LAPACK version 
3.11.0
   
   locale:
    [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               
LC_TIME=en_US.UTF-8       
    [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    
LC_MESSAGES=en_US.UTF-8   
    [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C      
        
   [10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 
LC_IDENTIFICATION=C       
   
   time zone: Europe/Berlin
   tzcode source: system (glibc)
   
   attached base packages:
   [1] stats     graphics  grDevices datasets  utils     methods   base     
   
   other attached packages:
    [1] treemapify_2.5.6  paletteer_1.6.0   ggplot2_3.5.2     viridis_0.6.5    
    [5] viridisLite_0.4.2 pracma_2.4.4      xtable_1.8-4      forcats_1.0.0    
    [9] readr_2.1.5       arrow_20.0.0      tidyr_1.3.1       stringr_1.5.1    
   [13] lubridate_1.9.4   dplyr_1.1.4      
   
   loaded via a namespace (and not attached):
    [1] utf8_1.2.6         generics_0.1.4     renv_1.0.3         stringi_1.8.7  
   
    [5] hms_1.1.3          magrittr_2.0.3     grid_4.5.0         
timechange_0.3.0  
    [9] RColorBrewer_1.1-3 blob_1.2.4         DBI_1.2.3          rematch2_2.1.2 
   
   [13] gridExtra_2.3      purrr_1.0.4        scales_1.4.0       duckdb_1.3.0   
   
   [17] cli_3.6.5          rlang_1.1.6        crayon_1.5.3       dbplyr_2.5.0   
   
   [21] bit64_4.6.0-1      withr_3.0.2        tools_4.5.0        parallel_4.5.0 
   
   [25] tzdb_0.5.0         assertthat_0.2.1   vctrs_0.6.5        R6_2.6.1       
   
   [29] lifecycle_1.0.4    bit_4.6.0          vroom_1.6.5        
pkgconfig_2.0.3   
   [33] pillar_1.10.2      gtable_0.3.6       glue_1.8.0         
ggfittext_0.10.2  
   [37] tibble_3.3.0       tidyselect_1.2.1   rstudioapi_0.17.1  farver_2.1.2   
   
   [41] compiler_4.5.0  
   ```
   
   #### `lsb_release -a`
   ```
   No LSB modules are available.
   Distributor ID: Debian
   Description:    Debian GNU/Linux 12 (bookworm)
   Release:        12
   Codename:       bookworm
   ```
   
   #### `uname -a`
   ```
   Linux <redacted> 6.1.0-34-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.135-1 
(2025-04-25) x86_64 GNU/Linux
   ```
   
   #### `cat /proc/cpuinfo` (truncated)
   ```
   processor       : 0
   vendor_id       : GenuineIntel
   cpu family      : 6
   model           : 85
   model name      : Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz
   stepping        : 4
   microcode       : 0x2007108
   cpu MHz         : 2992.968
   cache size      : 16384 KB
   physical id     : 0
   siblings        : 64
   core id         : 0
   cpu cores       : 64
   apicid          : 0
   initial apicid  : 0
   fpu             : yes
   fpu_exception   : yes
   cpuid level     : 22
   wp              : yes
   flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm 
constant_tsc arch_perfmon rep_good nopl xtopology cpuid tsc_known_freq pni 
pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt 
tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 
3dnowprefetch cpuid_fault invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow 
vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 
erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd 
avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat umip pku ospke md_clear 
flush_l1d arch_capabilities
   vmx flags       : vnmi preemption_timer posted_intr invvpid ept_x_only 
ept_ad ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid 
unrestricted_guest vapic_reg vid shadow_vmcs pml tsc_scaling
   bugs            : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf 
mds swapgs taa mmio_stale_data retbleed gds bhi ibpb_no_ret
   bogomips        : 5985.93
   clflush size    : 64
   cache_alignment : 64
   address sizes   : 46 bits physical, 48 bits virtual
   power management:
   ```
   
   
   ### Component(s)
   
   R


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to