Voltagabbana opened a new issue, #44472:
URL: https://github.com/apache/arrow/issues/44472

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   **Description**
   
   We've identified a memory leak when importing Parquet files into Pandas 
DataFrames using the PyArrow engine. The issue occurs specifically during the 
conversion from Arrow to Pandas objects, as memory is not released even after 
deleting the DataFrame and invoking garbage collection.
   
   **Key findings:**
   
   - **No leak with PyArrow alone:** When using PyArrow to read Parquet without 
converting to Pandas (i.e., no _.to_pandas()_), the memory leak does not occur.
   - **Leak with _.to_pandas()_:** The memory leak appears during the 
conversion from Arrow to Pandas, suggesting the problem is tied to this process.
   - **No issue with Fastparquet or Polars:** Fastparquet and Polars (even with 
PyArrow) do not exhibit this memory issue, reinforcing that the problem is in 
Pandas’ handling of Arrow data.
   
   **Reproduction Code**
   
   ```python
   
   # dataset_creation.py
   
   # just a fake dataset 
   
   import pandas as pd
   import numpy as np
   import random
   import string
   
   np.random.seed(42)
   random.seed(42)
   
   def random_string(length):
       letters = string.ascii_letters
       return ''.join(random.choice(letters) for _ in range(length))
   
   num_rows = 10**6  
   col_types = {
       'col1': lambda: random_string(10),   
       'col2': lambda: np.random.randint(0, 1000),  
       'col3': lambda: np.random.random(),   
       'col4': lambda: random_string(5), 
       'col5': lambda: np.random.randint(1000, 10000),  
       'col6': lambda: np.random.uniform(0, 100),     
       'col7': lambda: random_string(8),   
       'col8': lambda: np.random.random() * 1000,  
       'col9': lambda: np.random.randint(0, 2),  
       'col10': lambda: random_string(1000)    
   }
   
   data = {col: [func() for _ in range(num_rows)] for col, func in 
col_types.items()}
   df = pd.DataFrame(data)
   df.to_parquet('random_dataset.parquet', index=True)
   
   import os
   file_size = os.path.getsize('random_dataset.parquet') / (1024**3) 
   print(f"File size: {file_size:.2f} GB")
   
   
   ```
   
   
   ```python
   # memory_test.py
   
   import pandas as pd 
   import polars as pl
   import gc
   import pyarrow.parquet
   import ctypes
   
   data_path = 'random_dataset.parquet'
   
   # To manually trigger memory release
   malloc_trim = ctypes.CDLL("libc.so.6").malloc_trim
   
   for _ in range(10): 
       df = pd.read_parquet(data_path, engine="pyarrow")
       # Also tested with:
       # df = pyarrow.parquet.read_pandas(data_path).to_pandas()
       # df = pl.read_parquet(data_path, use_pyarrow=True)
       
       del df  # Explicitly delete DataFrame
       
       for _ in range(3):  # Force garbage collection multiple times
           gc.collect()
           memory_info = psutil.virtual_memory()
   
       print(f"\n\nIteration number: {i}")
       print(f"Total Memory: {memory_info.total / (1024 ** 3):.2f} GB")
       print(f"Memory at disposal: {memory_info.available / (1024 ** 3):.2f} 
GB")
       print(f"Memory Used: {memory_info.used / (1024 ** 3):.2f} GB")
       print(f"Percentage of memory used: {memory_info.percent}%")
   
   # Calling malloc_trim(0) is the only way we found to release the memory
   malloc_trim(0)
   ```
   
   
![image](https://github.com/user-attachments/assets/f05bf547-4e4c-41cb-9f49-8f6e164d4cbd)
   
   
   **Observations:**
   
   - **Garbage Collection:** Despite invoking the garbage collector multiple 
times, memory allocated to the Python process keeps increasing when 
_.to_pandas()_ is used, indicating improper memory release during the 
conversion.
   - **Direct Use of PyArrow:** When we import the data directly using PyArrow 
(without converting to Pandas), the memory usage remains stable, showing that 
the problem originates in the Arrow-to-Pandas conversion process.
   - **Manual Memory Release (ctypes):** The only reliable way we have found to 
release the memory is by manually calling _malloc_trim(0)_ via ctypes. However, 
we believe this is not a proper solution and that memory management should be 
handled internally by Pandas.
   
   **OS environment**
   
   _Icon name: computer-vm
   Chassis: vm
   Virtualization: microsoft
   Operating System: Red Hat Enterprise Linux 8.10 (Ootpa)
   CPE OS Name: cpe:/o:redhat:enterprise_linux:8::baseos
   Kernel: Linux 4.18.0-553.16.1.el8_10.x86_64
   Architecture: x86-64_
   
   **Conclusion**
   
   The issue seems to occur during the conversion from Arrow to Pandas, rather 
than being a problem within PyArrow itself. Given that memory is only released 
by manually invoking _malloc_trim(0)_, we suspect there is a problem with how 
PyArrow handles memory management when converting the data to Panda. This issue 
does not arise when using Fastparquet engine or Polars instead of Pandas, 
further indicating that it is specific to the Pandas-Arrow interaction.
   
   We recommend investigating how memory is allocated and released during the 
conversion from Arrow objects to Pandas DataFrames to resolve this issue.
   
   Please let us know if further details are needed, and we are happy to assist.
   
   **Contributors:**
   
   - @Voltagabbana 
   - @okamiRvS 
   - @carlonicolini 
   - @gfransvea
   
   We would appreciate any feedback or insights from the maintainers and other 
contributors on how to improve memory management in this context.
   
   
   ### Installed Versions
   
   INSTALLED VERSIONS
   ------------------
   commit                : d9cdd2ee5a58015ef6f4d15c7226110c9aab8140
   python                : 3.10.14.final.0
   python-bits           : 64
   OS                    : Linux
   OS-release            : 4.18.0-553.16.1.el8_10.x86_64
   Version               : #1 SMP Thu Aug 1 04:16:12 EDT 2024
   machine               : x86_64
   processor             : x86_64
   byteorder             : little
   LC_ALL                : None
   LANG                  : en_US.UTF-8
   LOCALE                : en_US.UTF-8
   
   pandas                : 2.2.2
   numpy                 : 2.0.0
   pytz                  : 2024.1
   dateutil              : 2.9.0.post0
   setuptools            : 69.5.1
   pip                   : 24.0
   Cython                : None
   pytest                : None
   hypothesis            : None
   sphinx                : None
   blosc                 : None
   feather               : None
   xlsxwriter            : None
   lxml.etree            : None
   html5lib              : None
   pymysql               : None
   psycopg2              : None
   jinja2                : 3.1.4
   IPython               : 8.26.0
   pandas_datareader     : None
   adbc-driver-postgresql: None
   adbc-driver-sqlite    : None
   bs4                   : 4.12.3
   bottleneck            : None
   dataframe-api-compat  : None
   fastparquet           : 2024.5.0
   fsspec                : 2024.6.1
   gcsfs                 : None
   matplotlib            : 3.9.0
   numba                 : None
   numexpr               : None
   odfpy                 : None
   openpyxl              : None
   pandas_gbq            : None
   pyarrow               : 17.0.0
   pyreadstat            : None
   python-calamine       : None
   pyxlsb                : None
   s3fs                  : None
   scipy                 : None
   sqlalchemy            : None
   tables                : None
   tabulate              : None
   xarray                : None
   xlrd                  : None
   zstandard             : None
   tzdata                : 2024.1
   qtpy                  : None
   pyqt5                 : None
   
   ### Component(s)
   
   Parquet, Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to