alanhdu opened a new issue, #44340:
URL: https://github.com/apache/arrow/issues/44340
### Describe the bug, including details regarding any error messages,
version, and platform.
I have a table with lots of strings that I would like to export to Pandas.
The following code can recreate the error:
```python
import numpy as np
import pyarrow as pa
SIZE = 1024
N = 2 * 1024 * 1024
buffer = np.random.bytes(N * SIZE)
table = pa.Table.from_pydict({
"row": [buffer[i * SIZE: (i + 1) * SIZE] for i in range(N)]
})
df = table.to_pandas(strings_to_categorical=True)
```
This is currently failing with the error:
```
Traceback (most recent call last):
File "/home/alandu/workspace/scratch/repro.py", line 13, in <module>
df = table.to_pandas(strings_to_categorical=True)
File "pyarrow/array.pxi", line 885, in
pyarrow.lib._PandasConvertible.to_pandas
File "pyarrow/table.pxi", line 5002, in pyarrow.lib.Table._to_pandas
File
"/home/alandu/micromamba/envs/test/lib/python3.10/site-packages/pyarrow/pandas_compat.py",
line 784, in table_to_dataframe
result = pa.lib.table_to_blocks(options, table, categories,
File "pyarrow/table.pxi", line 3941, in pyarrow.lib.table_to_blocks
File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowCapacityError: array cannot contain more than 2147483646
bytes, have 2147483648
```
This is using Python 3.10 on PyArrow 17.0 on Linux (installed via
conda-forge).
This *only* seems to happen when I set `strings_to_categorical=True` -- if
that is `False`, then I can export this to a dataframe without issues.
### Component(s)
Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]