lhoestq opened a new issue, #45214:
URL: https://github.com/apache/arrow/issues/45214

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   In the `datasets` library we are using `ParquetFragment.to_batches()` to 
stream batches of data while applying filters.
   
   However @AlexKoff88 reported that for some datasets like 
[phiyodr/InpaintCOCO](https://huggingface.co/datasets/phiyodr/InpaintCOCO) it 
causes the code to hang at https://github.com/huggingface/datasets/issues/7357.
   
   I managed to make a minimal reproducible example:
   
   ```
   wget 
https://huggingface.co/datasets/phiyodr/InpaintCOCO/resolve/c56e31947190173d2d6373c4833b0a9889ff6eee/data/test-00000-of-00003.parquet
   ```
   file info:
   - size: 300MB
   - 5 row groups of <100 rows
   - see all the parquet metadata 
[here](https://huggingface.co/datasets/phiyodr/InpaintCOCO/tree/c56e31947190173d2d6373c4833b0a9889ff6eee/data?show_file_info=data%2Ftest-00000-of-00003.parquet)
   - contains nested and binary types for images (not sure if relevant)
   
   ```python
   import pyarrow.dataset as ds
   
   file = "test-00000-of-00003.parquet"
   with open(file, "rb") as f:
       parquet_fragment = ds.ParquetFileFormat().make_fragment(f)
       for record_batch in parquet_fragment.to_batches():
           print(len(record_batch))  # 100
           break  # hangs forever
   ```
   
   Environment:
   - python 3.12.2
   - pyarrow 18.1.0
   - macbook pro m2
   
   The issue appears when running the python script, but doesn't appear in 
google colab or in ipython.
   
   The issue also appears in 
[eltorio/ROCOv2-radiology](https://huggingface.co/datasets/eltorio/ROCOv2-radiology)
 which happens to also contain binary types. The issue doesn't seem to appear 
in datasets like 
[AI-MO/NuminaMath-CoT](https://huggingface.co/datasets/AI-MO/NuminaMath-CoT) 
which don't contain binary types.
   
   
   ### Component(s)
   
   Parquet, Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to