Seminko opened a new issue, #47133:
URL: https://github.com/apache/arrow/issues/47133
### Describe the bug, including details regarding any error messages,
version, and platform.
I’m using snowflake-connector to import data to snowflake.
Snowflake-connector converts pandas DF into parquet using pyarrow and
imports that parquet into SF.
I’m routinely importing 1M row dataframes with no issues in my async script.
But one batch out of cca 50 runs into memory issues – pyarrow throws: ‚malloc
of size n failed‘.
I was able to identify a subset of the data that gives me trouble, even
specific column.
The column in question holds python dicts (parsed URLs).
There are no obvious bad types, non-string keys or missing expected keys. No
outliers in length or structure, yet it still fails.
When I do `table = pa.Table.from_pandas(df)` the memory starts to ballon
until it reaches almost 100% and then it fails. Or if it doesn’t fail it
gobbles up at least 22GB of RAM which can’t be right.
I tried identifying offending rows by processing the df in batches with the
idea of narrowing it down by reducing batch sizes. The strange thing is that
when I split the df into batches of 10k rows, all of them go through without an
issue, but when importing the whole dataframe (80k rows) it fails every time.
Hence I don’t believe there’s something wrong with specific rows that would
prevent the conversion to parquet.
Again, let me reiterate that I routinely import batches of 1M rows where the
dict column format is identical.
I tested this with pyarrow v18.0 as well as the latest stable v20.0.
[pyarrow_memory_leak_bug_report_df.pkl can be downloaded
here](https://drive.google.com/file/d/1LZp3wQoVwQj874Jx7n0a1wMfdcy04hqP/view)
```
import pandas as pd
import pyarrow as pa
df = pd.read_pickle("pyarrow_memory_leak_bug_report_df.pkl")
"This gobbles the whole ram"
table = pa.Table.from_pandas(df)
```
### Component(s)
Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]