twkim112 opened a new issue, #45504:
URL: https://github.com/apache/arrow/issues/45504

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   I’ve encountered a memory issue when reading Parquet files with Pandas using 
the pyarrow engine. Even though pyarrow.total_allocated_bytes() reports that 
allocated memory goes back to zero after each function call, the overall 
process memory (as reported by psutil) keeps increasing significantly over 
repeated calls.
   
   Steps to Reproduce:
   ``` python
   import psutil
   import time
   import pandas as pd
   import gc
   import pyarrow as pa
   
   # pa.jemalloc_set_decay_ms(0)
   
   def print_memory_usage():
       process = psutil.Process()
       mem_info = process.memory_info()
       print(f"PA allocated_bytes after function call: 
{pa.total_allocated_bytes() / 1024 / 1024:.2f} MB")
       print(f"Memory Usage: {mem_info.rss / 1024 / 1024:.2f} MB")
   
   def mem_and_time(func):
       def wrapper(*args, **kwargs):
           start_time = time.time()
           result = func(*args, **kwargs)
           end_time = time.time()
   
           # Print results
           print_memory_usage()
           print(f"Execution Time: {end_time - start_time:.6f} seconds")
           return result
       return wrapper
   
   @mem_and_time
   def test_func_pandas():
       # When using fastparquet, the memory usage is stable:
       # df = pd.read_parquet("/Users/test.parquet", engine='fastparquet')
       df = pd.read_parquet("/Users/test.parquet", engine='pyarrow')
       print(f"PA allocated_bytes inside function call: 
{pa.total_allocated_bytes() / 1024 / 1024:.2f} MB")
       return None
   
   if __name__ == "__main__":
       for _ in range(10000):
           test_func_pandas()
   
   ```
   
   Observe that:
   
   Inside each function call, pyarrow.total_allocated_bytes() reports a large 
allocation (e.g., ~2646 MB).
   After the function call, pyarrow.total_allocated_bytes() resets to 0 MB.
   However, the overall process memory usage (as shown by psutil) increases 
with each iteration.
   
   ```
   PA allocated_bytes inside function call: 2646.27 MB
   PA allocated_bytes after function call: 0.00 MB
   Memory Usage: 3147.12 MB
   Execution Time: 0.669164 seconds
   PA allocated_bytes inside function call: 2646.27 MB
   PA allocated_bytes after function call: 0.00 MB
   Memory Usage: 3945.00 MB
   Execution Time: 0.623360 seconds
   PA allocated_bytes inside function call: 2646.27 MB
   PA allocated_bytes after function call: 0.00 MB
   Memory Usage: 4494.80 MB
   Execution Time: 0.681895 seconds
   PA allocated_bytes inside function call: 2646.27 MB
   PA allocated_bytes after function call: 0.00 MB
   Memory Usage: 4865.27 MB
   Execution Time: 0.641056 seconds
   
   ...
   
   PA allocated_bytes inside function call: 2646.27 MB
   PA allocated_bytes after function call: 0.00 MB
   Memory Usage: 6157.64 MB
   Execution Time: 0.659480 seconds
   ```
   
   > Environment:
   > 
   > Python: 3.13.1
   > Pandas: 2.2.3
   > PyArrow: 19.0.0
   > OS: MacOs Sequoia 15.3.1
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to