martinstuder opened a new issue, #45765: URL: https://github.com/apache/arrow/issues/45765
### Describe the bug, including details regarding any error messages, version, and platform. **Environment** * Windows 11 (build 22631.4890) * pyarrow 19.0.0 * Python 3.11.10 **Issue Description** Data in nested list columns can be corrupted when writing partitioned parquet datasets, seemingly depending on how many rows the dataset has. **Reproducible example** ```python import numpy as np import pyarrow as pa import pyarrow.parquet as pq filename = 'repro.parquet' n_rows = 11_000 # works with 10_000 rows n_partitions = 7 n_nested_size = 100_000 lst = list(range(n_nested_size)) r = range(n_rows) table = pa.table({ 'id': list(r), 'partition': [i % n_partitions for i in r], 'nested': [np.random.permutation(lst) for _ in r], }) pq.write_to_dataset(table, filename, partition_cols=['partition']) # works without partitioning df = pq.read_pandas(filename) for i in r: assert len(set(df['nested'][i])) == n_nested_size # assertion error is triggered ``` **Findings** - Issue does not appear if you only choose 10'000 rows in the example above - Issue does not appear without partitioning - Issue does not appear with pyarrow 10.0.1 ### Component(s) Python -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org