yum-yab opened a new issue, #49474: URL: https://github.com/apache/arrow/issues/49474
### Describe the bug, including details regarding any error messages, version, and platform. I use pyarrow to laod and filter batches of a large hive-partitioned parquet dataset on a HPC cluster. Due to the memory restrictions imposed by it, my jobs kept getting OOM killed. When I started investigating, pyarrow kept accumulating RAM, *no matter the memory pool type*. When I finally switchd to system memory pool, it seems like this is a memory leak since RSS memroy keeps accumulating (blue line in graphic) even though pyarrow reports not that much allocated ram (see orange line in graphic): This makes it very hard to use pyarrow in memory restrained environments like an HPC cluster. My question is now: 1. Did I make any obvious mistake in using pyarrow? 2. Can this be prevented so HPC jobs run stable? Code used to reproduce the issue (and generate the diagram): <img width="1800" height="600" alt="Image" src="https://github.com/user-attachments/assets/eb714ee1-ed76-45ae-9e99-e07ebb28a3c6" /> Code to reproduce: [minimal_working_example.py](https://github.com/user-attachments/files/25839547/minimal_working_example.py) (DISCLAIMER: I wrote this code to be similar to my actual use case, the issue probebly still persists when stripping away stuff like filters and other cloumns. Still easy to reproduce the issue with the code provided.) System: Pyarrow and system version used: ``` PyArrow : 23.0.1 Python : 3.13.5 OS : Linux-6.12.73-1-MANJARO-x86_64-with-glibc2.43 BuildInfo(build_type='release', cpp_build_info=CppBuildInfo(version='23.0.1', version_info=VersionInfo(major=23, minor=0, patch=1), so_version='2300', full_so_version='2300.1.0', compiler_id='GNU', compiler_version='14.2.1', compiler_flags=' -Wno-noexcept-type -Wno-self-move -Wno-subobject-linkage -fdiagnostics-color=always -Wall -fno-semantic-interposition -msse4.2 ', git_id='', git_description='', package_kind='python-wheel-manylinux228', build_type='release')) ``` ### Component(s) Python, Parquet -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
