aboderinsamuel opened a new issue, #50165:
URL: https://github.com/apache/arrow/issues/50165
### Describe the enhancement requested
Tracking / discussion issue spun out of the review on #<THIS_PR> (which
implements `FixedShapeTensorType.to_pandas_dtype`, GH-49907).
Today all canonical extension types (`bool8`, `json`, `uuid`, `opaque`,
`fixed_shape_tensor`, …) inherit `DataType.to_pandas_dtype`, which raises
`NotImplementedError`. As a result `to_pandas` / `Table.to_pandas` fall back
to
converting the storage (often an object/numpy column), and
`Table.to_pandas(split_blocks=True)` raises `KeyError` for these columns.
#<THIS_PR> returns `pandas.ArrowDtype(self)` from
`FixedShapeTensorType.to_pandas_dtype` — a pandas `ExtensionDtype`
implementing
`__from_arrow__` — which fixes the error and yields a faithful,
round-trippable
extension column on pandas >= 2.1. This issue tracks extending that approach
and
the open questions raised in review.
### Open questions
1. **Which canonical extension types should implement `to_pandas_dtype`, and
to
what?** `pd.ArrowDtype(self)` is a sensible generic default, but some
types may
map more naturally to a native pandas dtype (e.g. `bool8` → a boolean
dtype).
Decide per-type vs. a shared default on `BaseExtensionType` — note a
`BaseExtensionType` default would also change behavior for *user-defined*
extension types, which relates to the `ExtensionScalar.as_py()` fallback
in
#33134.
2. **Implications for `to_pandas` / `Table.to_pandas`.** Returning a dtype
with
`__from_arrow__` changes conversion from the storage/object fallback to a
faithful extension-typed column. Pros: round-trips preserve the type,
`split_blocks=True` works. Cons: user-facing behavior change (changelog
needed); gated to pandas >= 2.1 (reliable `ArrowDtype` extension blocks,
GH-35821). `types_mapper` continues to take precedence.
3. **Docstring cleanup.** `BaseExtensionType` and its subclasses inherit
`to_pandas_dtype` (and related methods) from `DataType` with no mention of
extension-specific behavior; document this.
### Proposed direction
Keep #<THIS_PR> scoped to `fixed_shape_tensor`; handle the rest as small
follow-up PRs, each with its own changelog note:
- [ ] `bool8`
- [ ] `uuid`
- [ ] `json`
- [ ] `opaque`
- [ ] Docstring pass over `BaseExtensionType` + subclasses documenting
`to_pandas_dtype` / `to_pandas` behavior
cc @AlenkaF @jorisvandenbossche
### Component(s)
Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]