[I] Calling Table.from_pandas with a dataframe that contains a map column of sufficient size causes SIGABRT and process crash [arrow]

via GitHub Mon, 04 Nov 2024 21:37:11 -0800


snakingfire opened a new issue, #44643:
URL: https://github.com/apache/arrow/issues/44643


   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   Related to https://github.com/apache/arrow/issues/44640
   
   When attempting to convert a pandas dataframe that has a dict type column to 
a pyarrow table with a map column, if the dataframe and column are of 
sufficient size, the conversion fails with:
   ```
   /.../arrow/cpp/src/arrow/array/builder_nested.cc:103:  Check failed: 
(item_builder_->length()) == (key_builder_->length()) keys and items builders 
don't have the same size in MapBuilder
   ```
   This is immediately followed by SIGABRT and the process crashing. 
   
   When the dataframe is of a smaller size, the conversion succeeds without 
error. See below for reproduction code, when `dataframe_size` is set to a small 
value (eg 1M rows) there is no error, but at a certain size (eg, 10M rows) the 
error condition occurs. 
   
   ```python
   import pandas as pd
   import pyarrow
   
   # Example DataFrame creation
   import numpy as np
   import random
   import string
   
   dataframe_size = 10_000_000
   
   map_keys = [
       "a1B2c3D4e5",
       "f6G7h8I9j0",
       "k1L2m3N4o5",
       "p6Q7r8S9t0",
       "u1V2w3X4y5",
       "z6A7b8C9d0",
       "e1F2g3H4i5",
       "j6K7l8M9n0",
       "o1P2q3R4s5",
       "t6U7v8W9x0",
       "y1Z2a3B4c5",
       "d6E7f8G9h0",
       "i1J2k3L4m5",
       "n6O7p8Q9r0",
       "s1T2u3V4w5",
   ]
   
   # Pre-generate random strings for columns to avoid repeated computation
   print("Generating random column strings")
   random_strings = [
       "".join(random.choices(string.ascii_letters + string.digits, k=20))
       for _ in range(int(dataframe_size / 100))
   ]
   
   # Pre-generate random map values
   print("Generating random map value strings")
   random_map_values = [
       "".join(
           random.choices(
               string.ascii_letters + string.digits, k=random.randint(20, 200)
           )
       )
       for _ in range(int(dataframe_size / 100))
   ]
   
   print("Generating random maps")
   random_maps = [
       {
           key: random.choice(random_map_values)
           for key in random.sample(map_keys, random.randint(5, 10))
       }
       for _ in range(int(dataframe_size / 100))
   ]
   
   print("Generating random dataframe")
   data_with_map_col = {
       "partition": np.full(dataframe_size, "1"),
       "column1": np.random.choice(random_strings, dataframe_size),
       "map_col": np.random.choice(random_maps, dataframe_size),
   }
   
   # Create DataFrame
   df_with_map_col = pd.DataFrame(data_with_map_col)
   
   column_types = {
       "partition": pyarrow.string(),
       "column1": pyarrow.string(),
       "map_col": pyarrow.map_(pyarrow.string(), pyarrow.string()),
   }
   schema = pyarrow.schema(fields=column_types)
   
   # Process crashes when dataframe is large enough
   table = pyarrow.Table.from_pandas(
       df=df_with_map_col, schema=schema, preserve_index=False, safe=True
   )
   ```
   
   Environment Details:
   - Python Version: Python 3.11.8
   - Pyarrow version: 18.0.0
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] Calling Table.from_pandas with a dataframe that contains a map column of sufficient size causes SIGABRT and process crash [arrow]

Reply via email to