[I] Pyarrow conversion from and to pandas fails for categorical variables with large dictionaries [arrow]

via GitHub Tue, 10 Sep 2024 07:24:25 -0700


lukaswenzl-akur8 opened a new issue, #44048:
URL: https://github.com/apache/arrow/issues/44048


   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   Converting from pandas to pyarrow with Table.from_pandas for dataframes with 
categorical columns with large dictionaries fails. Similarly loading such a 
column from a parquet file and converting to pandas with Table.to_pandas() 
fails. 
   
   The failure happens when the total number of characters reaches the size of 
an unsigned 32bit integer (`np.sum(df["float_gran"].cat.categories.str.len()) > 
2_147_483_647`), indicating it may be an int32 Overflow issue. 
   
   
   Below an example code that reproduces the failure 
   
   ```python
   >>> import pyarrow as pa
   >>> import pyarrow.parquet as pq
   >>> import pandas as pd
   >>>pd.__version__
   '2.2.2'
   >>> import numpy as np
   >>> pa.__version__
   '17.0.0'
   >>>n_rows =  120_000_000
   >>>df = pd.DataFrame()
   >>>df["float_gran"] = np.random.rand(n_rows)
   >>>df["float_gran"] = df["float_gran"].astype(str).astype("category")
   >>>pa.Table.from_pandas(df)
   ---------------------------------------------------------------------------
   TypeError                                 Traceback (most recent call last)
   Cell In[6], line 1
   ----> 1 pa.Table.from_pandas(df)
   
   File 
~/miniforge3/envs/localtesting/lib/python3.9/site-packages/pyarrow/table.pxi:4623,
 in pyarrow.lib.Table.from_pandas()
   
   File 
~/miniforge3/envs/localtesting/lib/python3.9/site-packages/pyarrow/pandas_compat.py:616,
 in dataframe_to_arrays(df, schema, preserve_index, nthreads, columns, safe)
       611     return (isinstance(arr, np.ndarray) and
       612             arr.flags.contiguous and
       613             issubclass(arr.dtype.type, np.integer))
       615 if nthreads == 1:
   --> 616     arrays = [convert_column(c, f)
       617               for c, f in zip(columns_to_convert, convert_fields)]
       618 else:
       619     arrays = []
   
   File 
~/miniforge3/envs/localtesting/lib/python3.9/site-packages/pyarrow/pandas_compat.py:616,
 in <listcomp>(.0)
       611     return (isinstance(arr, np.ndarray) and
       612             arr.flags.contiguous and
       613             issubclass(arr.dtype.type, np.integer))
       615 if nthreads == 1:
   --> 616     arrays = [convert_column(c, f)
       617               for c, f in zip(columns_to_convert, convert_fields)]
       618 else:
       619     arrays = []
   
   File 
~/miniforge3/envs/localtesting/lib/python3.9/site-packages/pyarrow/pandas_compat.py:597,
 in dataframe_to_arrays.<locals>.convert_column(col, field)
       594     type_ = field.type
       596 try:
   --> 597     result = pa.array(col, type=type_, from_pandas=True, safe=safe)
       598 except (pa.ArrowInvalid,
       599         pa.ArrowNotImplementedError,
       600         pa.ArrowTypeError) as e:
       601     e.args += ("Conversion failed for column {!s} with type {!s}"
       602                .format(col.name, col.dtype),)
   
   File 
~/miniforge3/envs/localtesting/lib/python3.9/site-packages/pyarrow/array.pxi:346,
 in pyarrow.lib.array()
   
   File 
~/miniforge3/envs/localtesting/lib/python3.9/site-packages/pyarrow/array.pxi:3863,
 in pyarrow.lib.DictionaryArray.from_arrays()
   
   TypeError: Cannot convert pyarrow.lib.ChunkedArray to pyarrow.lib.Array
   ```
   Note: the same error message was noted in [issue 
#41936](https://github.com/apache/arrow/issues/41936), but there the discussion 
was about RecordBatch and it was noted Table.from_pandas, used here, should 
work fine.
   
   ```python
   >>>from pyarrow.interchange import from_dataframe
   >>>table = from_dataframe(df)
   >>>table
   pyarrow.Table
   float_gran: dictionary<values=large_string, indices=int32, ordered=0>
   ----
   float_gran: [  -- dictionary:
   
["0.00010000625394479545","0.00010000637024687453","0.00010002156605048995","0.00010002375983830802","0.00010003348618559116",...,"9.996332028860966e-05","9.996352231378403e-05","9.996744926299428e-06","9.99769273748452e-05","9.99829165827526e-05"]
  -- indices:
   
[72271820,16482433,4156153,49996213,77690435,...,24623248,57247299,27016212,102115156,112204811]]
   >>>table.to_pandas()
   >>>#works!
   >>>pq.write_table(table, "test", use_dictionary=False)
   >>>table_loaded = pq.read_table("test")
   >>>table_loaded.to_pandas()
   ---------------------------------------------------------------------------
   ArrowCapacityError                        Traceback (most recent call last)
   Cell In[18], line 1
   ----> 1 table_loaded.to_pandas()
   
   File 
~/miniforge3/envs/localtesting/lib/python3.9/site-packages/pyarrow/array.pxi:885,
 in pyarrow.lib._PandasConvertible.to_pandas()
   
   File 
~/miniforge3/envs/localtesting/lib/python3.9/site-packages/pyarrow/table.pxi:5002,
 in pyarrow.lib.Table._to_pandas()
   
   File 
~/miniforge3/envs/localtesting/lib/python3.9/site-packages/pyarrow/pandas_compat.py:784,
 in table_to_dataframe(options, table, categories, ignore_metadata, 
types_mapper)
       781 columns = _deserialize_column_index(table, all_columns, 
column_indexes)
       783 column_names = table.column_names
   --> 784 result = pa.lib.table_to_blocks(options, table, categories,
       785                                 list(ext_columns_dtypes.keys()))
       786 if _pandas_api.is_ge_v3():
       787     from pandas.api.internals import create_dataframe_from_blocks
   
   File 
~/miniforge3/envs/localtesting/lib/python3.9/site-packages/pyarrow/table.pxi:3941,
 in pyarrow.lib.table_to_blocks()
   
   File 
~/miniforge3/envs/localtesting/lib/python3.9/site-packages/pyarrow/error.pxi:92,
 in pyarrow.lib.check_status()
   
   ArrowCapacityError: array cannot contain more than 2147483646 bytes, have 
2147483657
   ```
   
   
   Tested on macos Sonoma 14.5, errors also happened on linux servers
   
   It seems from_dataframe avoids the error by leveraging a 'large_string' 
datatype. However we find the from_dataframe method to perform significantly 
worse than from_pandas in most cases and would therefore like to avoid using 
it. Additionally the large_string datatype seems to be lost on reload.
   
   Is there already a way to reliably avoid the TypeError and 
ArrowCapacityError in the optimized methods for pandas and is this a bug that 
could be fixed in future versions?
   
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] Pyarrow conversion from and to pandas fails for categorical variables with large dictionaries [arrow]

Reply via email to