martinstuder opened a new issue, #45765:
URL: https://github.com/apache/arrow/issues/45765

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   **Environment**
   * Windows 11 (build 22631.4890)
   * pyarrow 19.0.0
   * Python 3.11.10
   
   **Issue Description**
   Data in nested list columns can be corrupted when writing partitioned 
parquet datasets, seemingly depending on how many rows the dataset has.
   
   **Reproducible example**
   
   ```python
   import numpy as np
   import pyarrow as pa
   import pyarrow.parquet as pq
   
   filename = 'repro.parquet'
   
   n_rows = 11_000  # works with 10_000 rows
   n_partitions = 7
   n_nested_size = 100_000
   
   lst = list(range(n_nested_size))
   r = range(n_rows)
   
   table = pa.table({
       'id': list(r),
       'partition': [i % n_partitions for i in r],
       'nested': [np.random.permutation(lst) for _ in r],
   })
   
   pq.write_to_dataset(table, filename, partition_cols=['partition'])  # works 
without partitioning
   
   df = pq.read_pandas(filename)
   for i in r:
       assert len(set(df['nested'][i])) == n_nested_size  # assertion error is 
triggered
   ```
   
   **Findings**
   - Issue does not appear if you only choose 10'000 rows in the example above
   - Issue does not appear without partitioning
   - Issue does not appear with pyarrow 10.0.1
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to