Voltagabbana opened a new issue, #44472: URL: https://github.com/apache/arrow/issues/44472
### Describe the bug, including details regarding any error messages, version, and platform. **Description** We've identified a memory leak when importing Parquet files into Pandas DataFrames using the PyArrow engine. The issue occurs specifically during the conversion from Arrow to Pandas objects, as memory is not released even after deleting the DataFrame and invoking garbage collection. **Key findings:** - **No leak with PyArrow alone:** When using PyArrow to read Parquet without converting to Pandas (i.e., no _.to_pandas()_), the memory leak does not occur. - **Leak with _.to_pandas()_:** The memory leak appears during the conversion from Arrow to Pandas, suggesting the problem is tied to this process. - **No issue with Fastparquet or Polars:** Fastparquet and Polars (even with PyArrow) do not exhibit this memory issue, reinforcing that the problem is in Pandas’ handling of Arrow data. **Reproduction Code** ```python # dataset_creation.py # just a fake dataset import pandas as pd import numpy as np import random import string np.random.seed(42) random.seed(42) def random_string(length): letters = string.ascii_letters return ''.join(random.choice(letters) for _ in range(length)) num_rows = 10**6 col_types = { 'col1': lambda: random_string(10), 'col2': lambda: np.random.randint(0, 1000), 'col3': lambda: np.random.random(), 'col4': lambda: random_string(5), 'col5': lambda: np.random.randint(1000, 10000), 'col6': lambda: np.random.uniform(0, 100), 'col7': lambda: random_string(8), 'col8': lambda: np.random.random() * 1000, 'col9': lambda: np.random.randint(0, 2), 'col10': lambda: random_string(1000) } data = {col: [func() for _ in range(num_rows)] for col, func in col_types.items()} df = pd.DataFrame(data) df.to_parquet('random_dataset.parquet', index=True) import os file_size = os.path.getsize('random_dataset.parquet') / (1024**3) print(f"File size: {file_size:.2f} GB") ``` ```python # memory_test.py import pandas as pd import polars as pl import gc import pyarrow.parquet import ctypes data_path = 'random_dataset.parquet' # To manually trigger memory release malloc_trim = ctypes.CDLL("libc.so.6").malloc_trim for _ in range(10): df = pd.read_parquet(data_path, engine="pyarrow") # Also tested with: # df = pyarrow.parquet.read_pandas(data_path).to_pandas() # df = pl.read_parquet(data_path, use_pyarrow=True) del df # Explicitly delete DataFrame for _ in range(3): # Force garbage collection multiple times gc.collect() memory_info = psutil.virtual_memory() print(f"\n\nIteration number: {i}") print(f"Total Memory: {memory_info.total / (1024 ** 3):.2f} GB") print(f"Memory at disposal: {memory_info.available / (1024 ** 3):.2f} GB") print(f"Memory Used: {memory_info.used / (1024 ** 3):.2f} GB") print(f"Percentage of memory used: {memory_info.percent}%") # Calling malloc_trim(0) is the only way we found to release the memory malloc_trim(0) ```  **Observations:** - **Garbage Collection:** Despite invoking the garbage collector multiple times, memory allocated to the Python process keeps increasing when _.to_pandas()_ is used, indicating improper memory release during the conversion. - **Direct Use of PyArrow:** When we import the data directly using PyArrow (without converting to Pandas), the memory usage remains stable, showing that the problem originates in the Arrow-to-Pandas conversion process. - **Manual Memory Release (ctypes):** The only reliable way we have found to release the memory is by manually calling _malloc_trim(0)_ via ctypes. However, we believe this is not a proper solution and that memory management should be handled internally by Pandas. **OS environment** _Icon name: computer-vm Chassis: vm Virtualization: microsoft Operating System: Red Hat Enterprise Linux 8.10 (Ootpa) CPE OS Name: cpe:/o:redhat:enterprise_linux:8::baseos Kernel: Linux 4.18.0-553.16.1.el8_10.x86_64 Architecture: x86-64_ **Conclusion** The issue seems to occur during the conversion from Arrow to Pandas, rather than being a problem within PyArrow itself. Given that memory is only released by manually invoking _malloc_trim(0)_, we suspect there is a problem with how PyArrow handles memory management when converting the data to Panda. This issue does not arise when using Fastparquet engine or Polars instead of Pandas, further indicating that it is specific to the Pandas-Arrow interaction. We recommend investigating how memory is allocated and released during the conversion from Arrow objects to Pandas DataFrames to resolve this issue. Please let us know if further details are needed, and we are happy to assist. **Contributors:** - @Voltagabbana - @okamiRvS - @carlonicolini - @gfransvea We would appreciate any feedback or insights from the maintainers and other contributors on how to improve memory management in this context. ### Installed Versions INSTALLED VERSIONS ------------------ commit : d9cdd2ee5a58015ef6f4d15c7226110c9aab8140 python : 3.10.14.final.0 python-bits : 64 OS : Linux OS-release : 4.18.0-553.16.1.el8_10.x86_64 Version : #1 SMP Thu Aug 1 04:16:12 EDT 2024 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 2.2.2 numpy : 2.0.0 pytz : 2024.1 dateutil : 2.9.0.post0 setuptools : 69.5.1 pip : 24.0 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.4 IPython : 8.26.0 pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : 4.12.3 bottleneck : None dataframe-api-compat : None fastparquet : 2024.5.0 fsspec : 2024.6.1 gcsfs : None matplotlib : 3.9.0 numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 17.0.0 pyreadstat : None python-calamine : None pyxlsb : None s3fs : None scipy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2024.1 qtpy : None pyqt5 : None ### Component(s) Parquet, Python -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org