AlenkaF opened a new issue, #34165:
URL: https://github.com/apache/arrow/issues/34165
### Describe the bug, including details regarding any error messages,
version, and platform.
When working on the extension type for tensors in PyArrow I came across a
behaviour of the conversion to pandas that could be improved.
Creating an extension array (fixed shape tensor in this case) and converting
it to pandas works well
```python
>>> arr = [[1, 2, 3, 4], [10, 20, 30, 40], [100, 200, 300, 400]]
>>> storage = pa.array(arr, pa.list_(pa.int32(), 4))
>>> tensor = pa.ExtensionArray.from_storage(tensor_type, storage)
>>> tensor.to_pandas()
0 [1, 2, 3, 4]
1 [10, 20, 30, 40]
2 [100, 200, 300, 400]
dtype: object
```
But creating a table with an extension array and then converting it to
pandas fails:
```python
>>> data = [
... pa.array([1, 2, 3]),
... pa.array(['foo', 'bar', None]),
... pa.array([True, None, True]),
... tensor
... ]
>>> my_schema = pa.schema([('f0', pa.int8()),
... ('f1', pa.string()),
... ('f2', pa.bool_()),
... ('tensors_int', tensor_type)])
>>> table = pa.Table.from_arrays(data, schema=my_schema)
>>> table.to_pandas()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "pyarrow/array.pxi", line 830, in
pyarrow.lib._PandasConvertible.to_pandas
return self._to_pandas(options, categories=categories,
File "pyarrow/table.pxi", line 4004, in pyarrow.lib.Table._to_pandas
mgr = table_to_blockmanager(
File "/Users/alenkafrim/repos/arrow/python/pyarrow/pandas_compat.py", line
820, in table_to_blockmanager
blocks = _table_to_blocks(options, table, categories, ext_columns_dtypes)
File "/Users/alenkafrim/repos/arrow/python/pyarrow/pandas_compat.py", line
1171, in _table_to_blocks
return [_reconstruct_block(item, columns, extension_columns)
File "/Users/alenkafrim/repos/arrow/python/pyarrow/pandas_compat.py", line
1171, in <listcomp>
return [_reconstruct_block(item, columns, extension_columns)
File "/Users/alenkafrim/repos/arrow/python/pyarrow/pandas_compat.py", line
776, in _reconstruct_block
pandas_dtype = extension_columns[name]
KeyError: 'tensors_int'
```
The issue is due to the extension array in this example not having
`to_pandas_dtype` method implemented. In this case `ext_columns` does not get
populated in `_get_extension_dtypes` method with the name of the column with an
extension type:
https://github.com/apache/arrow/blob/0368e410be4dac30eada13d307b415165aedc6a7/python/pyarrow/pandas_compat.py#L870-L879
It would be good if it would, in case `to_pandas_dtype` method is not
implemented, convert the storage array
https://github.com/apache/arrow/blob/0368e410be4dac30eada13d307b415165aedc6a7/python/pyarrow/pandas_compat.py#L776
similar to
https://github.com/apache/arrow/blob/925cbd81427ae02ce897c406a264d53c8813b920/python/pyarrow/array.pxi#L2888-L2889
### Component(s)
Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]