adrien-grl opened a new issue, #50012:
URL: https://github.com/apache/arrow/issues/50012
### Describe the bug, including details regarding any error messages,
version, and platform.
**Bug description**
When `pa.Table.from_pylist` is given a schema containing a
`pa.ExtensionType` containing a `pa.list_` field, and the cumulative values in
that list field across rows exceed int32 max, the call fails with:
```
TypeError: Argument 'storage' has incorrect type (expected
pyarrow.lib.Array, got pyarrow.lib.ChunkedArray)
```
The message doesn't provide indication about the actual cause of the issue
(for instance that it originates from the a `pa.list_` or a `pa.ExtensionType`).
**Environment**
- PyArrow 24.0.0
- Python 3.12, Linux x86_64
**Minimal steps to reproduce**
The code requires roughly 3GB RAM.
```python
import numpy as np
import pyarrow as pa
class FooExt(pa.ExtensionType):
def __init__(self):
super().__init__(
pa.struct({"data": pa.list_(pa.uint8())}),
"foo_img",
)
def __arrow_ext_serialize__(self):
return b""
@classmethod
def __arrow_ext_deserialize__(cls, storage_type, serialized):
return cls()
pa.register_extension_type(FooExt())
schema = pa.schema({"img": FooExt()})
# 5 rows × 500M values = 2.5B > int32 max
arr = np.zeros(500_000_000, dtype=np.uint8)
rows = [{"img": {"data": arr}} for _ in range(5)]
pa.Table.from_pylist(rows, schema=schema)
# TypeError: Argument 'storage' has incorrect type
# (expected pyarrow.lib.Array, got pyarrow.lib.ChunkedArray)
```
**Expected behavior**
Either:
1. An actionable error that names the column, identifies the int32-offset
cause, and maybe even points at the escape routes (`pa.large_list`, smaller
batches, or manual chunked construction), or
2. A successful build that returns a `ChunkedArray<ExtensionArray>` whose
chunks each fit in int32 offsets.
### Component(s)
Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]