Tom-Newton opened a new issue, #47266:
URL: https://github.com/apache/arrow/issues/47266
### Describe the bug, including details regarding any error messages,
version, and platform.
Since 21.0.0 memory usage grows a lot when repeatedly reading a parquet
dataset on the local disk. With verison 20.0.0 the memory usage increased much
less.
### Script to reproduce ###
```
import tempfile
import numpy
import pyarrow
import pyarrow.dataset
import pyarrow.parquet
from memory_profiler import profile
def test_memory_leak():
num_columns = 10
num_rows = 5_000_000
data = {f"col_{i}": numpy.random.rand(num_rows) for i in
range(num_columns)}
table = pyarrow.Table.from_pydict(data)
with tempfile.TemporaryDirectory() as temp_dir:
pyarrow.dataset.write_dataset(table, temp_dir, format="parquet")
@profile
def read():
return pyarrow.dataset.dataset(temp_dir).to_table()
for _ in range(50):
read()
if __name__ == "__main__":
test_memory_leak()
```
### Environment ###
Ubuntu 24.04.2 LTS
Tested python 3.10.15 and python 3.12.3
Python packages:
```
$ pip freeze
memory-profiler==0.61.0
numpy==2.3.2
psutil==7.0.0
pyarrow==21.0.0
```
When using `pyarrow==21.0.0` the memory usage increases with the iterations.
After the first read its at about 1.5GiB. After the 50th read its at about
20GiB. If I run the same test with `pyarrow==20.0.0` the memory usage still
increases slightly with the iterations but its still less than 2GiB after the
50th iteration.
### Debugging ###
I ran a git bisect and identified https://github.com/apache/arrow/pull/45979
as the change point. Building from the 21.0.0 release commit with
`ARROW_MIMALLOC=OFF` also solves the problem.
### Component(s)
C++, Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]