lukaswenzl-akur8 opened a new issue, #44048:
URL: https://github.com/apache/arrow/issues/44048
### Describe the bug, including details regarding any error messages,
version, and platform.
Converting from pandas to pyarrow with Table.from_pandas for dataframes with
categorical columns with large dictionaries fails. Similarly loading such a
column from a parquet file and converting to pandas with Table.to_pandas()
fails.
The failure happens when the total number of characters reaches the size of
an unsigned 32bit integer (`np.sum(df["float_gran"].cat.categories.str.len()) >
2_147_483_647`), indicating it may be an int32 Overflow issue.
Below an example code that reproduces the failure
```python
>>> import pyarrow as pa
>>> import pyarrow.parquet as pq
>>> import pandas as pd
>>>pd.__version__
'2.2.2'
>>> import numpy as np
>>> pa.__version__
'17.0.0'
>>>n_rows = 120_000_000
>>>df = pd.DataFrame()
>>>df["float_gran"] = np.random.rand(n_rows)
>>>df["float_gran"] = df["float_gran"].astype(str).astype("category")
>>>pa.Table.from_pandas(df)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[6], line 1
----> 1 pa.Table.from_pandas(df)
File
~/miniforge3/envs/localtesting/lib/python3.9/site-packages/pyarrow/table.pxi:4623,
in pyarrow.lib.Table.from_pandas()
File
~/miniforge3/envs/localtesting/lib/python3.9/site-packages/pyarrow/pandas_compat.py:616,
in dataframe_to_arrays(df, schema, preserve_index, nthreads, columns, safe)
611 return (isinstance(arr, np.ndarray) and
612 arr.flags.contiguous and
613 issubclass(arr.dtype.type, np.integer))
615 if nthreads == 1:
--> 616 arrays = [convert_column(c, f)
617 for c, f in zip(columns_to_convert, convert_fields)]
618 else:
619 arrays = []
File
~/miniforge3/envs/localtesting/lib/python3.9/site-packages/pyarrow/pandas_compat.py:616,
in <listcomp>(.0)
611 return (isinstance(arr, np.ndarray) and
612 arr.flags.contiguous and
613 issubclass(arr.dtype.type, np.integer))
615 if nthreads == 1:
--> 616 arrays = [convert_column(c, f)
617 for c, f in zip(columns_to_convert, convert_fields)]
618 else:
619 arrays = []
File
~/miniforge3/envs/localtesting/lib/python3.9/site-packages/pyarrow/pandas_compat.py:597,
in dataframe_to_arrays.<locals>.convert_column(col, field)
594 type_ = field.type
596 try:
--> 597 result = pa.array(col, type=type_, from_pandas=True, safe=safe)
598 except (pa.ArrowInvalid,
599 pa.ArrowNotImplementedError,
600 pa.ArrowTypeError) as e:
601 e.args += ("Conversion failed for column {!s} with type {!s}"
602 .format(col.name, col.dtype),)
File
~/miniforge3/envs/localtesting/lib/python3.9/site-packages/pyarrow/array.pxi:346,
in pyarrow.lib.array()
File
~/miniforge3/envs/localtesting/lib/python3.9/site-packages/pyarrow/array.pxi:3863,
in pyarrow.lib.DictionaryArray.from_arrays()
TypeError: Cannot convert pyarrow.lib.ChunkedArray to pyarrow.lib.Array
```
Note: the same error message was noted in [issue
#41936](https://github.com/apache/arrow/issues/41936), but there the discussion
was about RecordBatch and it was noted Table.from_pandas, used here, should
work fine.
```python
>>>from pyarrow.interchange import from_dataframe
>>>table = from_dataframe(df)
>>>table
pyarrow.Table
float_gran: dictionary<values=large_string, indices=int32, ordered=0>
----
float_gran: [ -- dictionary:
["0.00010000625394479545","0.00010000637024687453","0.00010002156605048995","0.00010002375983830802","0.00010003348618559116",...,"9.996332028860966e-05","9.996352231378403e-05","9.996744926299428e-06","9.99769273748452e-05","9.99829165827526e-05"]
-- indices:
[72271820,16482433,4156153,49996213,77690435,...,24623248,57247299,27016212,102115156,112204811]]
>>>table.to_pandas()
>>>#works!
>>>pq.write_table(table, "test", use_dictionary=False)
>>>table_loaded = pq.read_table("test")
>>>table_loaded.to_pandas()
---------------------------------------------------------------------------
ArrowCapacityError Traceback (most recent call last)
Cell In[18], line 1
----> 1 table_loaded.to_pandas()
File
~/miniforge3/envs/localtesting/lib/python3.9/site-packages/pyarrow/array.pxi:885,
in pyarrow.lib._PandasConvertible.to_pandas()
File
~/miniforge3/envs/localtesting/lib/python3.9/site-packages/pyarrow/table.pxi:5002,
in pyarrow.lib.Table._to_pandas()
File
~/miniforge3/envs/localtesting/lib/python3.9/site-packages/pyarrow/pandas_compat.py:784,
in table_to_dataframe(options, table, categories, ignore_metadata,
types_mapper)
781 columns = _deserialize_column_index(table, all_columns,
column_indexes)
783 column_names = table.column_names
--> 784 result = pa.lib.table_to_blocks(options, table, categories,
785 list(ext_columns_dtypes.keys()))
786 if _pandas_api.is_ge_v3():
787 from pandas.api.internals import create_dataframe_from_blocks
File
~/miniforge3/envs/localtesting/lib/python3.9/site-packages/pyarrow/table.pxi:3941,
in pyarrow.lib.table_to_blocks()
File
~/miniforge3/envs/localtesting/lib/python3.9/site-packages/pyarrow/error.pxi:92,
in pyarrow.lib.check_status()
ArrowCapacityError: array cannot contain more than 2147483646 bytes, have
2147483657
```
Tested on macos Sonoma 14.5, errors also happened on linux servers
It seems from_dataframe avoids the error by leveraging a 'large_string'
datatype. However we find the from_dataframe method to perform significantly
worse than from_pandas in most cases and would therefore like to avoid using
it. Additionally the large_string datatype seems to be lost on reload.
Is there already a way to reliably avoid the TypeError and
ArrowCapacityError in the optimized methods for pandas and is this a bug that
could be fixed in future versions?
### Component(s)
Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]