r2evans opened a new issue, #45373:
URL: https://github.com/apache/arrow/issues/45373

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   I think there's a bug in when 
   
   If there's an `arrange(.)` in the lazy pipeline that is followed by some 
aggregation with `summarize`, the collection still looks for the sorting column:
   
   ```r
   library(arrow)
   library(dplyr)
   arrow_table(mtcars) |>
     summarize(across(mpg, list(Min = min, Max = max))) |>
     collect()
   # # A tibble: 1 × 2
   #   mpg_Min mpg_Max
   #     <dbl>   <dbl>
   # 1    10.4    33.9
   
   arrow_table(mtcars) |>
     arrange(mpg) |>
     summarize(across(mpg, list(Min = min, Max = max))) |>
     collect()
   # Error in compute.arrow_dplyr_query(x) : 
   #   Invalid: Invalid sort key column: No match for FieldRef.Name(mpg) in 
mpg_Min: double
   # mpg_Max: double
   # ----
   # mpg_Min:
   #   [
   #     [
   #       10.4
   #     ]
   #   ]
   # mpg_Max:
   #   [
   #     [
   #       33.9
   #     ]
   #   ]
   ```
   
   
   This example is somewhat contrived _here_, in that this summarization does 
not need ordered data. The underlying issue remains: why does it not sort the 
data _at that point_ and then summarize? I'm not certain if this is a problem 
with lazy sorting or if it is too aggressive preserving the sort-field(s).
   
   This behavior is in contrast to a `select`ion removing the sorting column:
   
   ```r
   arrow_table(mtcars) |>
     arrange(mpg) |>
     select(-mpg) |>
     collect()
   # # A tibble: 32 × 10
   #      cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   #    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
   #  1     8  472    205  2.93  5.25  18.0     0     0     3     4
   #  2     8  460    215  3     5.42  17.8     0     0     3     4
   #  3     8  350    245  3.73  3.84  15.4     0     0     3     4
   #  4     8  360    245  3.21  3.57  15.8     0     0     3     4
   #  5     8  440    230  3.23  5.34  17.4     0     0     3     4
   #  6     8  301    335  3.54  3.57  14.6     0     1     5     8
   #  7     8  276.   180  3.07  3.78  18       0     0     3     3
   #  8     8  304    150  3.15  3.44  17.3     0     0     3     2
   #  9     8  318    150  2.76  3.52  16.9     0     0     3     2
   # 10     8  351    264  4.22  3.17  14.5     0     1     5     4
   # # ℹ 22 more rows
   # # ℹ Use `print(n = ...)` to see more rows
   ```
   
   <details>
   <summary> <code>> sessionInfo()</code> </summary>
   
   ```r
   R version 4.4.2 (2024-10-31)
   Platform: aarch64-apple-darwin20
   Running under: macOS Sequoia 15.2
   
   Matrix products: default
   BLAS:   
/Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
 
   LAPACK: 
/Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;
  LAPACK version 3.12.0
   
   locale:
   [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
   
   time zone: America/New_York
   tzcode source: internal
   
   attached base packages:
   [1] stats     graphics  grDevices utils     datasets  methods   base     
   
   other attached packages:
   [1] arrow_18.1.0.1 dplyr_1.1.4   
   
   loaded via a namespace (and not attached):
    [1] assertthat_0.2.1 utf8_1.2.4       R6_2.5.1         bit_4.5.0.1      
tidyselect_1.2.1 magrittr_2.0.3   glue_1.8.0       tibble_3.2.1     
pkgconfig_2.0.3  bit64_4.5.2     
   [11] generics_0.1.3   lifecycle_1.0.4  cli_3.6.3        fansi_1.0.6      
vctrs_0.6.5      withr_3.0.2      compiler_4.4.2   purrr_1.0.2      
pillar_1.9.0     rlang_1.1.4     
   ```
   
   </details>
   
   ### Component(s)
   
   R


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to