yum-yab opened a new issue, #49474:
URL: https://github.com/apache/arrow/issues/49474

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   I use pyarrow to laod and filter batches of a large hive-partitioned parquet 
dataset on a HPC cluster.
   
   Due to the memory restrictions imposed by it, my jobs kept getting OOM 
killed. When I started investigating, pyarrow kept accumulating RAM, *no matter 
the memory pool type*.
   
   When I finally switchd to system memory pool, it seems like this is a memory 
leak since RSS memroy keeps accumulating (blue line in graphic) even though 
pyarrow reports not that much allocated ram (see orange line in graphic):
   
   This makes it very hard to use pyarrow in memory restrained environments 
like an HPC cluster.
   
   My question is now:
   
   1. Did I make any obvious mistake in using pyarrow?
   2. Can this be prevented so HPC jobs run stable?
   
   Code used to reproduce the issue (and generate the diagram):
   
   <img width="1800" height="600" alt="Image" 
src="https://github.com/user-attachments/assets/eb714ee1-ed76-45ae-9e99-e07ebb28a3c6";
 />
   
   Code to reproduce:
   
   
[minimal_working_example.py](https://github.com/user-attachments/files/25839547/minimal_working_example.py)
   
   (DISCLAIMER: I wrote this code to be similar to my actual use case, the 
issue probebly still persists when stripping away stuff like filters and other 
cloumns. Still easy to reproduce the issue with the code provided.)
   
   System: 
   
   Pyarrow and system version used:
   
   ```
   PyArrow : 23.0.1
   Python  : 3.13.5
   OS      : Linux-6.12.73-1-MANJARO-x86_64-with-glibc2.43
   BuildInfo(build_type='release', 
cpp_build_info=CppBuildInfo(version='23.0.1', 
version_info=VersionInfo(major=23, minor=0, patch=1), so_version='2300', 
full_so_version='2300.1.0', compiler_id='GNU', compiler_version='14.2.1', 
compiler_flags=' -Wno-noexcept-type -Wno-self-move -Wno-subobject-linkage  
-fdiagnostics-color=always  -Wall -fno-semantic-interposition -msse4.2 ', 
git_id='', git_description='', package_kind='python-wheel-manylinux228', 
build_type='release'))
   ```
   
   
   ### Component(s)
   
   Python, Parquet


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to