martinstuder opened a new issue, #45765:
URL: https://github.com/apache/arrow/issues/45765
### Describe the bug, including details regarding any error messages,
version, and platform.
**Environment**
* Windows 11 (build 22631.4890)
* pyarrow 19.0.0
* Python 3.11.10
**Issue Description**
Data in nested list columns can be corrupted when writing partitioned
parquet datasets, seemingly depending on how many rows the dataset has.
**Reproducible example**
```python
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq
filename = 'repro.parquet'
n_rows = 11_000 # works with 10_000 rows
n_partitions = 7
n_nested_size = 100_000
lst = list(range(n_nested_size))
r = range(n_rows)
table = pa.table({
'id': list(r),
'partition': [i % n_partitions for i in r],
'nested': [np.random.permutation(lst) for _ in r],
})
pq.write_to_dataset(table, filename, partition_cols=['partition']) # works
without partitioning
df = pq.read_pandas(filename)
for i in r:
assert len(set(df['nested'][i])) == n_nested_size # assertion error is
triggered
```
**Findings**
- Issue does not appear if you only choose 10'000 rows in the example above
- Issue does not appear without partitioning
- Issue does not appear with pyarrow 10.0.1
### Component(s)
Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]