wklimowicz opened a new issue, #47169:
URL: https://github.com/apache/arrow/issues/47169
### Describe the bug, including details regarding any error messages,
version, and platform.
When writing large list columns to parquet, arrow errors out with:
```
Error: Capacity error: List array cannot contain more than 2147483646
elements, have 1200
```
Reproducible example, works with the CRAN `arrow` version (20.0.0.2), and
the current git version (21.0.0.9000).
```r
library(tibble)
library(arrow)
rows <- 2e6L
elements_each <- 1200L
tbl <- tibble(
id = seq_len(rows),
b = replicate(rows, list(seq_len(elements_each)), simplify = FALSE)
)
write_parquet(tbl, "big_list.parquet")
```
Actual behaviour: `Error: Capacity error: List array cannot contain more
than 2147483646 elements, have 1200`.
Expected behaviour: Automatically chunking behind the scenes, or a
suggestion of how the user should chunk manually.
I think this is a similar bug to #10776, but happens with writing rather
than reading. I'm looking for clarity whether this can be automatically chunked
in the spirit of spirit of the
[vignette](https://arrow.apache.org/docs/r/articles/data_objects.html):
> An important thing to note is that “chunking” is not semantically
meaningful. It is an implementation detail only: users should never treat the
chunk as a meaningful unit.
Alternatively a workaround would be good: I've tried some with
`write_dataset`, but I don't understand the internals well enough. Two things
which didn't work (same error):
```r
# Approach 1:
# Group by + write_dataset
tbl |>
dplyr::group_by(id = id %% 10L) |> # Create many groups by ID
write_dataset("big_list")
# Approach 2:
# max_rows...
tbl |>
write_dataset(
"big_list",
max_rows_per_file = 5000L,
max_rows_per_group = 5000L
)
```
<details>
<summary> session_info() </summary>
```
─ Session info ──────────────────────
setting value
version R version 4.5.0 (2025-04-11)
os Fedora Linux 42 (Workstation Edition)
system x86_64, linux-gnu
ui X11
language (EN)
collate en_GB.UTF-8
ctype en_GB.UTF-8
tz Europe/London
date 2025-07-22
pandoc 3.1.11.1 @ /usr/bin/pandoc
quarto 99.9.9 @ /home/wojtek/.local/bin/quarto
─ Packages ───────────────────────────
package * version date (UTC) lib source
arrow * 21.0.0.9000 2025-07-22 [1] local
assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.5.0)
bit 4.6.0 2025-03-06 [1] CRAN (R 4.5.0)
bit64 4.6.0-1 2025-01-16 [1] CRAN (R 4.5.0)
cli 3.6.5 2025-04-23 [1] CRAN (R 4.5.0)
glue 1.8.0 2024-09-30 [1] CRAN (R 4.5.0)
lifecycle 1.0.4 2023-11-07 [1] CRAN (R 4.5.0)
magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.5.0)
pillar 1.11.0 2025-07-04 [1] CRAN (R 4.5.0)
pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.5.0)
purrr 1.1.0 2025-07-10 [1] CRAN (R 4.5.0)
R6 2.6.1 2025-02-15 [1] CRAN (R 4.5.0)
rlang 1.1.6 2025-04-11 [1] CRAN (R 4.5.0)
sessioninfo 1.2.3 2025-02-05 [1] CRAN (R 4.5.0)
tibble * 3.3.0 2025-06-08 [1] CRAN (R 4.5.0)
tidyselect 1.2.1 2024-03-11 [1] CRAN (R 4.5.0)
vctrs 0.6.5 2023-12-01 [1] CRAN (R 4.5.0)
[1] /home/wojtek/.local/share/R/x86_64-pc-linux-gnu-library/4.5
[2] /opt/R/4.5.0/lib64/R/library
* ── Packages attached to the search path.
```
</details>
### Component(s)
C++, R
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]