[I] Excessive memory usage in creating a pyarrow Table from pandas [arrow]

via GitHub Wed, 29 Jan 2025 15:00:47 -0800


rohan-shah-nearmap opened a new issue, #45385:
URL: https://github.com/apache/arrow/issues/45385


   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   With pyarrow `18.1.0`, I have the following situation:
   I have a pandas `DataFrame` which has been constructed as a merge of two 
original tables. This seems to give the constructed table some kind of 
fragmented memory structure:
   ```
   arrays, schema, n_rows = pa.pandas_compat.dataframe_to_arrays(merged_df, 
schema = schema, preserve_index=False)
   >>> len(arrays)
   1
   >>> arrays[0].num_chunks
   67988
   ```
   Constructing a table using these arrays *and specifying a schema* results in 
a huge memory blowout (> 80gb). Even though the expectation would be ~100mb, 
for all the data. 
   ```
   # Memory blowout
   table = pa.Table.from_arrays([arrays[0]], schema = schema)
   ```
   Not specifying a schema, no such blowout:
   ```
   # No memory blowout
   table = pa.Table.from_arrays([arrays[0]], names = ....)
   ```
   But using `combine_chunks` seems to fix the problem:
   ```
   table = pa.Table.from_arrays([arrays[0].combine_chunks()], schema = schema)
   ```
   I have checked the `pyarrow` codebase, and `_sanitize_arrays` takes 
different code paths depending on whether a schema is specified.
   
   Hypothesis:
   The codepath of `_sanitize_arrays` that runs if you specify a schema does 
not handle highly fragmented inputs well. 
   
   Apologies for not giving a more specific reproducible example, but as it 
seems to depend on memory layout, it seems difficult to reduce my case to 
something small. I'm hoping that someone can take the above and work out what's 
happening.
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] Excessive memory usage in creating a pyarrow Table from pandas [arrow]

Reply via email to