Seminko opened a new issue, #47133:
URL: https://github.com/apache/arrow/issues/47133

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   I’m using snowflake-connector to import data to snowflake.
   Snowflake-connector converts pandas DF into parquet using pyarrow and 
imports that parquet into SF.
   I’m routinely importing 1M row dataframes with no issues in my async script. 
But one batch out of cca 50 runs into memory issues – pyarrow throws: ‚malloc 
of size n failed‘.
    
   I was able to identify a subset of the data that gives me trouble, even 
specific column.
   The column in question holds python dicts (parsed URLs).
   There are no obvious bad types, non-string keys or missing expected keys. No 
outliers in length or structure, yet it still fails.
   When I do `table = pa.Table.from_pandas(df)` the memory starts to ballon 
until it reaches almost 100% and then it fails. Or if it doesn’t fail it 
gobbles up at least 22GB of RAM which can’t be right.
    
   I tried identifying offending rows by processing the df in batches with the 
idea of narrowing it down by reducing batch sizes. The strange thing is that 
when I split the df into batches of 10k rows, all of them go through without an 
issue, but when importing the whole dataframe (80k rows) it fails every time. 
Hence I don’t believe there’s something wrong with specific rows that would 
prevent the conversion to parquet.
   Again, let me reiterate that I routinely import batches of 1M rows where the 
dict column format is identical.
   
   I tested this with pyarrow v18.0 as well as the latest stable v20.0.
   
   [pyarrow_memory_leak_bug_report_df.pkl can be downloaded 
here](https://drive.google.com/file/d/1LZp3wQoVwQj874Jx7n0a1wMfdcy04hqP/view)
   ```
   import pandas as pd
   import pyarrow as pa
   
   df = pd.read_pickle("pyarrow_memory_leak_bug_report_df.pkl")
   
   "This gobbles the whole ram"
   table = pa.Table.from_pandas(df)
   ```
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to