rohan-shah-nearmap opened a new issue, #45385: URL: https://github.com/apache/arrow/issues/45385
### Describe the bug, including details regarding any error messages, version, and platform. With pyarrow `18.1.0`, I have the following situation: I have a pandas `DataFrame` which has been constructed as a merge of two original tables. This seems to give the constructed table some kind of fragmented memory structure: ``` arrays, schema, n_rows = pa.pandas_compat.dataframe_to_arrays(merged_df, schema = schema, preserve_index=False) >>> len(arrays) 1 >>> arrays[0].num_chunks 67988 ``` Constructing a table using these arrays *and specifying a schema* results in a huge memory blowout (> 80gb). Even though the expectation would be ~100mb, for all the data. ``` # Memory blowout table = pa.Table.from_arrays([arrays[0]], schema = schema) ``` Not specifying a schema, no such blowout: ``` # No memory blowout table = pa.Table.from_arrays([arrays[0]], names = ....) ``` But using `combine_chunks` seems to fix the problem: ``` table = pa.Table.from_arrays([arrays[0].combine_chunks()], schema = schema) ``` I have checked the `pyarrow` codebase, and `_sanitize_arrays` takes different code paths depending on whether a schema is specified. Hypothesis: The codepath of `_sanitize_arrays` that runs if you specify a schema does not handle highly fragmented inputs well. Apologies for not giving a more specific reproducible example, but as it seems to depend on memory layout, it seems difficult to reduce my case to something small. I'm hoping that someone can take the above and work out what's happening. ### Component(s) Python -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org