bveeramani opened a new issue, #45439: URL: https://github.com/apache/arrow/issues/45439
### Describe the bug, including details regarding any error messages, version, and platform. ## Problem I'm trying to read batches of a large Parquet file, but I'm encountering OOMs. It seems like the memory used by `to_batches` monotonically increases with each output, and eventually reads the whole file into memory. ## Repro Code to create `large_dataset.parquet` is here: https://gist.github.com/bveeramani/5b43b3e57fa3f68cf2ac1d724bca99bf#file-3-py. ```python import os import psutil import pyarrow.parquet as pq process = psutil.Process(os.getpid()) dataset = pq.ParquetDataset("large_dataset.parquet") fragment = dataset.fragments[0] for i, batch in enumerate(fragment.to_batches()): # This goes from 1.4 GiB for the first output to 19.2 GiB for the last output print(i, process.memory_info().rss) ``` ### Component(s) Python -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org