lhoestq opened a new issue, #45214: URL: https://github.com/apache/arrow/issues/45214
### Describe the bug, including details regarding any error messages, version, and platform. In the `datasets` library we are using `ParquetFragment.to_batches()` to stream batches of data while applying filters. However @AlexKoff88 reported that for some datasets like [phiyodr/InpaintCOCO](https://huggingface.co/datasets/phiyodr/InpaintCOCO) it causes the code to hang at https://github.com/huggingface/datasets/issues/7357. I managed to make a minimal reproducible example: ``` wget https://huggingface.co/datasets/phiyodr/InpaintCOCO/resolve/c56e31947190173d2d6373c4833b0a9889ff6eee/data/test-00000-of-00003.parquet ``` file info: - size: 300MB - 5 row groups of <100 rows - see all the parquet metadata [here](https://huggingface.co/datasets/phiyodr/InpaintCOCO/tree/c56e31947190173d2d6373c4833b0a9889ff6eee/data?show_file_info=data%2Ftest-00000-of-00003.parquet) - contains nested and binary types for images (not sure if relevant) ```python import pyarrow.dataset as ds file = "test-00000-of-00003.parquet" with open(file, "rb") as f: parquet_fragment = ds.ParquetFileFormat().make_fragment(f) for record_batch in parquet_fragment.to_batches(): print(len(record_batch)) # 100 break # hangs forever ``` Environment: - python 3.12.2 - pyarrow 18.1.0 - macbook pro m2 The issue appears when running the python script, but doesn't appear in google colab or in ipython. The issue also appears in [eltorio/ROCOv2-radiology](https://huggingface.co/datasets/eltorio/ROCOv2-radiology) which happens to also contain binary types. The issue doesn't seem to appear in datasets like [AI-MO/NuminaMath-CoT](https://huggingface.co/datasets/AI-MO/NuminaMath-CoT) which don't contain binary types. ### Component(s) Parquet, Python -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org