twkim112 opened a new issue, #45504: URL: https://github.com/apache/arrow/issues/45504
### Describe the bug, including details regarding any error messages, version, and platform. I’ve encountered a memory issue when reading Parquet files with Pandas using the pyarrow engine. Even though pyarrow.total_allocated_bytes() reports that allocated memory goes back to zero after each function call, the overall process memory (as reported by psutil) keeps increasing significantly over repeated calls. Steps to Reproduce: ``` python import psutil import time import pandas as pd import gc import pyarrow as pa # pa.jemalloc_set_decay_ms(0) def print_memory_usage(): process = psutil.Process() mem_info = process.memory_info() print(f"PA allocated_bytes after function call: {pa.total_allocated_bytes() / 1024 / 1024:.2f} MB") print(f"Memory Usage: {mem_info.rss / 1024 / 1024:.2f} MB") def mem_and_time(func): def wrapper(*args, **kwargs): start_time = time.time() result = func(*args, **kwargs) end_time = time.time() # Print results print_memory_usage() print(f"Execution Time: {end_time - start_time:.6f} seconds") return result return wrapper @mem_and_time def test_func_pandas(): # When using fastparquet, the memory usage is stable: # df = pd.read_parquet("/Users/test.parquet", engine='fastparquet') df = pd.read_parquet("/Users/test.parquet", engine='pyarrow') print(f"PA allocated_bytes inside function call: {pa.total_allocated_bytes() / 1024 / 1024:.2f} MB") return None if __name__ == "__main__": for _ in range(10000): test_func_pandas() ``` Observe that: Inside each function call, pyarrow.total_allocated_bytes() reports a large allocation (e.g., ~2646 MB). After the function call, pyarrow.total_allocated_bytes() resets to 0 MB. However, the overall process memory usage (as shown by psutil) increases with each iteration. ``` PA allocated_bytes inside function call: 2646.27 MB PA allocated_bytes after function call: 0.00 MB Memory Usage: 3147.12 MB Execution Time: 0.669164 seconds PA allocated_bytes inside function call: 2646.27 MB PA allocated_bytes after function call: 0.00 MB Memory Usage: 3945.00 MB Execution Time: 0.623360 seconds PA allocated_bytes inside function call: 2646.27 MB PA allocated_bytes after function call: 0.00 MB Memory Usage: 4494.80 MB Execution Time: 0.681895 seconds PA allocated_bytes inside function call: 2646.27 MB PA allocated_bytes after function call: 0.00 MB Memory Usage: 4865.27 MB Execution Time: 0.641056 seconds ... PA allocated_bytes inside function call: 2646.27 MB PA allocated_bytes after function call: 0.00 MB Memory Usage: 6157.64 MB Execution Time: 0.659480 seconds ``` > Environment: > > Python: 3.13.1 > Pandas: 2.2.3 > PyArrow: 19.0.0 > OS: MacOs Sequoia 15.3.1 ### Component(s) Python -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org