bveeramani opened a new issue, #45439:
URL: https://github.com/apache/arrow/issues/45439

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   ## Problem
   
   I'm trying to read batches of a large Parquet file, but I'm encountering 
OOMs. It seems like the memory used by `to_batches` monotonically increases 
with each output, and eventually reads the whole file into memory.
   
   ## Repro
   
   Code to create `large_dataset.parquet` is here: 
https://gist.github.com/bveeramani/5b43b3e57fa3f68cf2ac1d724bca99bf#file-3-py.
   
   ```python
   import os
   
   import psutil
   import pyarrow.parquet as pq
   
   process = psutil.Process(os.getpid())
   dataset = pq.ParquetDataset("large_dataset.parquet")
   fragment = dataset.fragments[0]
   for i, batch in enumerate(fragment.to_batches()):
       # This goes from 1.4 GiB for the first output to 19.2 GiB for the last 
output
       print(i, process.memory_info().rss)
   ```
   
   
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to