snakingfire opened a new issue, #44643: URL: https://github.com/apache/arrow/issues/44643
### Describe the bug, including details regarding any error messages, version, and platform. Related to https://github.com/apache/arrow/issues/44640 When attempting to convert a pandas dataframe that has a dict type column to a pyarrow table with a map column, if the dataframe and column are of sufficient size, the conversion fails with: ``` /.../arrow/cpp/src/arrow/array/builder_nested.cc:103: Check failed: (item_builder_->length()) == (key_builder_->length()) keys and items builders don't have the same size in MapBuilder ``` This is immediately followed by SIGABRT and the process crashing. When the dataframe is of a smaller size, the conversion succeeds without error. See below for reproduction code, when `dataframe_size` is set to a small value (eg 1M rows) there is no error, but at a certain size (eg, 10M rows) the error condition occurs. ```python import pandas as pd import pyarrow # Example DataFrame creation import numpy as np import random import string dataframe_size = 10_000_000 map_keys = [ "a1B2c3D4e5", "f6G7h8I9j0", "k1L2m3N4o5", "p6Q7r8S9t0", "u1V2w3X4y5", "z6A7b8C9d0", "e1F2g3H4i5", "j6K7l8M9n0", "o1P2q3R4s5", "t6U7v8W9x0", "y1Z2a3B4c5", "d6E7f8G9h0", "i1J2k3L4m5", "n6O7p8Q9r0", "s1T2u3V4w5", ] # Pre-generate random strings for columns to avoid repeated computation print("Generating random column strings") random_strings = [ "".join(random.choices(string.ascii_letters + string.digits, k=20)) for _ in range(int(dataframe_size / 100)) ] # Pre-generate random map values print("Generating random map value strings") random_map_values = [ "".join( random.choices( string.ascii_letters + string.digits, k=random.randint(20, 200) ) ) for _ in range(int(dataframe_size / 100)) ] print("Generating random maps") random_maps = [ { key: random.choice(random_map_values) for key in random.sample(map_keys, random.randint(5, 10)) } for _ in range(int(dataframe_size / 100)) ] print("Generating random dataframe") data_with_map_col = { "partition": np.full(dataframe_size, "1"), "column1": np.random.choice(random_strings, dataframe_size), "map_col": np.random.choice(random_maps, dataframe_size), } # Create DataFrame df_with_map_col = pd.DataFrame(data_with_map_col) column_types = { "partition": pyarrow.string(), "column1": pyarrow.string(), "map_col": pyarrow.map_(pyarrow.string(), pyarrow.string()), } schema = pyarrow.schema(fields=column_types) # Process crashes when dataframe is large enough table = pyarrow.Table.from_pandas( df=df_with_map_col, schema=schema, preserve_index=False, safe=True ) ``` Environment Details: - Python Version: Python 3.11.8 - Pyarrow version: 18.0.0 ### Component(s) Python -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org