[I] `pa.Table.to_pandas(strings_to_categorical=True)` failing for large string/binary arrays [arrow]

via GitHub Tue, 08 Oct 2024 16:40:05 -0700


alanhdu opened a new issue, #44340:
URL: https://github.com/apache/arrow/issues/44340


   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   I have a table with lots of strings that I would like to export to Pandas. 
The following code can recreate the error: 
   
   ```python
   import numpy as np
   import pyarrow as pa
   
   SIZE = 1024
   N = 2 * 1024 * 1024
   
   buffer = np.random.bytes(N * SIZE)
   table = pa.Table.from_pydict({
       "row": [buffer[i * SIZE: (i + 1) * SIZE] for i in range(N)]
   })
   df = table.to_pandas(strings_to_categorical=True)
   ```
   
   This is currently failing with the error:
   ```
   Traceback (most recent call last):
     File "/home/alandu/workspace/scratch/repro.py", line 13, in <module>
       df = table.to_pandas(strings_to_categorical=True)
     File "pyarrow/array.pxi", line 885, in 
pyarrow.lib._PandasConvertible.to_pandas
     File "pyarrow/table.pxi", line 5002, in pyarrow.lib.Table._to_pandas
     File 
"/home/alandu/micromamba/envs/test/lib/python3.10/site-packages/pyarrow/pandas_compat.py",
 line 784, in table_to_dataframe
       result = pa.lib.table_to_blocks(options, table, categories,
     File "pyarrow/table.pxi", line 3941, in pyarrow.lib.table_to_blocks
     File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
   pyarrow.lib.ArrowCapacityError: array cannot contain more than 2147483646 
bytes, have 2147483648
   ```
   
   This is using Python 3.10 on PyArrow 17.0 on Linux (installed via 
conda-forge).
   
   This *only* seems to happen when I set `strings_to_categorical=True` -- if 
that is `False`, then I can export this to a dataframe without issues.
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] `pa.Table.to_pandas(strings_to_categorical=True)` failing for large string/binary arrays [arrow]

Reply via email to