lukaswenzl-akur8 opened a new issue, #44048: URL: https://github.com/apache/arrow/issues/44048
### Describe the bug, including details regarding any error messages, version, and platform. Converting from pandas to pyarrow with Table.from_pandas for dataframes with categorical columns with large dictionaries fails. Similarly loading such a column from a parquet file and converting to pandas with Table.to_pandas() fails. The failure happens when the total number of characters reaches the size of an unsigned 32bit integer (`np.sum(df["float_gran"].cat.categories.str.len()) > 2_147_483_647`), indicating it may be an int32 Overflow issue. Below an example code that reproduces the failure ```python >>> import pyarrow as pa >>> import pyarrow.parquet as pq >>> import pandas as pd >>>pd.__version__ '2.2.2' >>> import numpy as np >>> pa.__version__ '17.0.0' >>>n_rows = 120_000_000 >>>df = pd.DataFrame() >>>df["float_gran"] = np.random.rand(n_rows) >>>df["float_gran"] = df["float_gran"].astype(str).astype("category") >>>pa.Table.from_pandas(df) --------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[6], line 1 ----> 1 pa.Table.from_pandas(df) File ~/miniforge3/envs/localtesting/lib/python3.9/site-packages/pyarrow/table.pxi:4623, in pyarrow.lib.Table.from_pandas() File ~/miniforge3/envs/localtesting/lib/python3.9/site-packages/pyarrow/pandas_compat.py:616, in dataframe_to_arrays(df, schema, preserve_index, nthreads, columns, safe) 611 return (isinstance(arr, np.ndarray) and 612 arr.flags.contiguous and 613 issubclass(arr.dtype.type, np.integer)) 615 if nthreads == 1: --> 616 arrays = [convert_column(c, f) 617 for c, f in zip(columns_to_convert, convert_fields)] 618 else: 619 arrays = [] File ~/miniforge3/envs/localtesting/lib/python3.9/site-packages/pyarrow/pandas_compat.py:616, in <listcomp>(.0) 611 return (isinstance(arr, np.ndarray) and 612 arr.flags.contiguous and 613 issubclass(arr.dtype.type, np.integer)) 615 if nthreads == 1: --> 616 arrays = [convert_column(c, f) 617 for c, f in zip(columns_to_convert, convert_fields)] 618 else: 619 arrays = [] File ~/miniforge3/envs/localtesting/lib/python3.9/site-packages/pyarrow/pandas_compat.py:597, in dataframe_to_arrays.<locals>.convert_column(col, field) 594 type_ = field.type 596 try: --> 597 result = pa.array(col, type=type_, from_pandas=True, safe=safe) 598 except (pa.ArrowInvalid, 599 pa.ArrowNotImplementedError, 600 pa.ArrowTypeError) as e: 601 e.args += ("Conversion failed for column {!s} with type {!s}" 602 .format(col.name, col.dtype),) File ~/miniforge3/envs/localtesting/lib/python3.9/site-packages/pyarrow/array.pxi:346, in pyarrow.lib.array() File ~/miniforge3/envs/localtesting/lib/python3.9/site-packages/pyarrow/array.pxi:3863, in pyarrow.lib.DictionaryArray.from_arrays() TypeError: Cannot convert pyarrow.lib.ChunkedArray to pyarrow.lib.Array ``` Note: the same error message was noted in [issue #41936](https://github.com/apache/arrow/issues/41936), but there the discussion was about RecordBatch and it was noted Table.from_pandas, used here, should work fine. ```python >>>from pyarrow.interchange import from_dataframe >>>table = from_dataframe(df) >>>table pyarrow.Table float_gran: dictionary<values=large_string, indices=int32, ordered=0> ---- float_gran: [ -- dictionary: ["0.00010000625394479545","0.00010000637024687453","0.00010002156605048995","0.00010002375983830802","0.00010003348618559116",...,"9.996332028860966e-05","9.996352231378403e-05","9.996744926299428e-06","9.99769273748452e-05","9.99829165827526e-05"] -- indices: [72271820,16482433,4156153,49996213,77690435,...,24623248,57247299,27016212,102115156,112204811]] >>>table.to_pandas() >>>#works! >>>pq.write_table(table, "test", use_dictionary=False) >>>table_loaded = pq.read_table("test") >>>table_loaded.to_pandas() --------------------------------------------------------------------------- ArrowCapacityError Traceback (most recent call last) Cell In[18], line 1 ----> 1 table_loaded.to_pandas() File ~/miniforge3/envs/localtesting/lib/python3.9/site-packages/pyarrow/array.pxi:885, in pyarrow.lib._PandasConvertible.to_pandas() File ~/miniforge3/envs/localtesting/lib/python3.9/site-packages/pyarrow/table.pxi:5002, in pyarrow.lib.Table._to_pandas() File ~/miniforge3/envs/localtesting/lib/python3.9/site-packages/pyarrow/pandas_compat.py:784, in table_to_dataframe(options, table, categories, ignore_metadata, types_mapper) 781 columns = _deserialize_column_index(table, all_columns, column_indexes) 783 column_names = table.column_names --> 784 result = pa.lib.table_to_blocks(options, table, categories, 785 list(ext_columns_dtypes.keys())) 786 if _pandas_api.is_ge_v3(): 787 from pandas.api.internals import create_dataframe_from_blocks File ~/miniforge3/envs/localtesting/lib/python3.9/site-packages/pyarrow/table.pxi:3941, in pyarrow.lib.table_to_blocks() File ~/miniforge3/envs/localtesting/lib/python3.9/site-packages/pyarrow/error.pxi:92, in pyarrow.lib.check_status() ArrowCapacityError: array cannot contain more than 2147483646 bytes, have 2147483657 ``` Tested on macos Sonoma 14.5, errors also happened on linux servers It seems from_dataframe avoids the error by leveraging a 'large_string' datatype. However we find the from_dataframe method to perform significantly worse than from_pandas in most cases and would therefore like to avoid using it. Additionally the large_string datatype seems to be lost on reload. Is there already a way to reliably avoid the TypeError and ArrowCapacityError in the optimized methods for pandas and is this a bug that could be fixed in future versions? ### Component(s) Python -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org