aboderinsamuel opened a new issue, #50165:
URL: https://github.com/apache/arrow/issues/50165

   ### Describe the enhancement requested
   
   Tracking / discussion issue spun out of the review on #<THIS_PR> (which
   implements `FixedShapeTensorType.to_pandas_dtype`, GH-49907).
   
   Today all canonical extension types (`bool8`, `json`, `uuid`, `opaque`,
   `fixed_shape_tensor`, …) inherit `DataType.to_pandas_dtype`, which raises
   `NotImplementedError`. As a result `to_pandas` / `Table.to_pandas` fall back 
to
   converting the storage (often an object/numpy column), and
   `Table.to_pandas(split_blocks=True)` raises `KeyError` for these columns.
   
   #<THIS_PR> returns `pandas.ArrowDtype(self)` from
   `FixedShapeTensorType.to_pandas_dtype` — a pandas `ExtensionDtype` 
implementing
   `__from_arrow__` — which fixes the error and yields a faithful, 
round-trippable
   extension column on pandas >= 2.1. This issue tracks extending that approach 
and
   the open questions raised in review.
   
   ### Open questions
   
   1. **Which canonical extension types should implement `to_pandas_dtype`, and 
to
      what?** `pd.ArrowDtype(self)` is a sensible generic default, but some 
types may
      map more naturally to a native pandas dtype (e.g. `bool8` → a boolean 
dtype).
      Decide per-type vs. a shared default on `BaseExtensionType` — note a
      `BaseExtensionType` default would also change behavior for *user-defined*
      extension types, which relates to the `ExtensionScalar.as_py()` fallback 
in
      #33134.
   
   2. **Implications for `to_pandas` / `Table.to_pandas`.** Returning a dtype 
with
      `__from_arrow__` changes conversion from the storage/object fallback to a
      faithful extension-typed column. Pros: round-trips preserve the type,
      `split_blocks=True` works. Cons: user-facing behavior change (changelog
      needed); gated to pandas >= 2.1 (reliable `ArrowDtype` extension blocks,
      GH-35821). `types_mapper` continues to take precedence.
   
   3. **Docstring cleanup.** `BaseExtensionType` and its subclasses inherit
      `to_pandas_dtype` (and related methods) from `DataType` with no mention of
      extension-specific behavior; document this.
   
   ### Proposed direction
   
   Keep #<THIS_PR> scoped to `fixed_shape_tensor`; handle the rest as small
   follow-up PRs, each with its own changelog note:
   
   - [ ] `bool8`
   - [ ] `uuid`
   - [ ] `json`
   - [ ] `opaque`
   - [ ] Docstring pass over `BaseExtensionType` + subclasses documenting
         `to_pandas_dtype` / `to_pandas` behavior
   
   cc @AlenkaF @jorisvandenbossche
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to