Tom-Newton opened a new issue, #47266:
URL: https://github.com/apache/arrow/issues/47266

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   Since 21.0.0 memory usage grows a lot when repeatedly reading a parquet 
dataset on the local disk. With verison 20.0.0 the memory usage increased much 
less. 
   
   ### Script to reproduce ###
   ```
   import tempfile
   import numpy
   import pyarrow
   import pyarrow.dataset
   import pyarrow.parquet
   from memory_profiler import profile
   
   def test_memory_leak():
       num_columns = 10
       num_rows = 5_000_000
   
       data = {f"col_{i}": numpy.random.rand(num_rows) for i in 
range(num_columns)}
       table = pyarrow.Table.from_pydict(data)
   
       with tempfile.TemporaryDirectory() as temp_dir:
           pyarrow.dataset.write_dataset(table, temp_dir, format="parquet")
   
           @profile
           def read():
               return pyarrow.dataset.dataset(temp_dir).to_table()
   
           for _ in range(50):
               read()
   
   if __name__ == "__main__":
       test_memory_leak()
   ```
   
   ### Environment ###
   Ubuntu 24.04.2 LTS
   Tested python 3.10.15 and python 3.12.3
   
   Python packages:
   ```
   $ pip freeze
   memory-profiler==0.61.0
   numpy==2.3.2
   psutil==7.0.0
   pyarrow==21.0.0
   ```
   
   When using `pyarrow==21.0.0` the memory usage increases with the iterations. 
After the first read its at about 1.5GiB. After the 50th read its at about 
20GiB. If I run the same test with `pyarrow==20.0.0` the memory usage still 
increases slightly with the iterations but its still less than 2GiB after the 
50th iteration. 
   
   
   ### Debugging ###
   I ran a git bisect and identified https://github.com/apache/arrow/pull/45979 
as the change point. Building from the 21.0.0 release commit with 
`ARROW_MIMALLOC=OFF` also solves the problem. 
   
   
   ### Component(s)
   
   C++, Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to