NellyWhads opened a new issue, #44473:
URL: https://github.com/apache/arrow/issues/44473

   ### Describe the enhancement requested
   
   I'm looking for documentation on how to implement an ExtensionArray which 
supports `join` functionality.
   
   Particularly, I'd like to join a table which includes a 
`FixedShapeTensorArray` column with another table.
   
   Here's a simple example which does not work.
   
   ```python
   import numpy as np
   import pyarrow as pa
   
   # First dim is the batch dim
   tensors = np.arange(3 * 10 * 10).reshape((3, 10, 10)).astype(np.uint8)
   tensor_array = pa.FixedShapeTensorArray.from_numpy_ndarray(tensors)
   ids = pa.array([1,2,3], type=pa.uint8())
   table = pa.Table.from_arrays([ids, tensor_array], names=["id", "tensor"])
   print(table.schema)
   
   classes = pa.array(["one", "two", "three"], type=pa.string())
   table_2 = pa.Table.from_arrays([ids, classes], names=["id", "name"])
   print(table_2.schema)
   
   table.join(table_2, keys=["id"], join_type="full outer")
   ```
   
   This raises the error
   ```
   ---------------------------------------------------------------------------
   ArrowInvalid                              Traceback (most recent call last)
   Cell In[42], [line 1](vscode-notebook-cell:?execution_count=42&line=1)
   ----> [1](vscode-notebook-cell:?execution_count=42&line=1) 
table.join(table_2, keys=["id"], join_type="full outer")
   
   File 
~/.pyenv/versions/next_gen_data_38/lib/python3.8/site-packages/pyarrow/table.pxi:5570,
 in pyarrow.lib.Table.join()
   
   File 
~/.pyenv/versions/next_gen_data_38/lib/python3.8/site-packages/pyarrow/acero.py:247,
 in _perform_join(join_type, left_operand, left_keys, right_operand, 
right_keys, left_suffix, right_suffix, use_threads, coalesce_keys, output_type)
       
[242](https://file+.vscode-resource.vscode-cdn.net/Users/neil.wadhvana/workspaces/main/torc-robotics/pytorc/projects/next_gen_data/next_gen_data/~/.pyenv/versions/next_gen_data_38/lib/python3.8/site-packages/pyarrow/acero.py:242)
     projection = Declaration(
       
[243](https://file+.vscode-resource.vscode-cdn.net/Users/neil.wadhvana/workspaces/main/torc-robotics/pytorc/projects/next_gen_data/next_gen_data/~/.pyenv/versions/next_gen_data_38/lib/python3.8/site-packages/pyarrow/acero.py:243)
         "project", ProjectNodeOptions(projections, projected_col_names)
       
[244](https://file+.vscode-resource.vscode-cdn.net/Users/neil.wadhvana/workspaces/main/torc-robotics/pytorc/projects/next_gen_data/next_gen_data/~/.pyenv/versions/next_gen_data_38/lib/python3.8/site-packages/pyarrow/acero.py:244)
     )
       
[245](https://file+.vscode-resource.vscode-cdn.net/Users/neil.wadhvana/workspaces/main/torc-robotics/pytorc/projects/next_gen_data/next_gen_data/~/.pyenv/versions/next_gen_data_38/lib/python3.8/site-packages/pyarrow/acero.py:245)
     decl = Declaration.from_sequence([decl, projection])
   --> 
[247](https://file+.vscode-resource.vscode-cdn.net/Users/neil.wadhvana/workspaces/main/torc-robotics/pytorc/projects/next_gen_data/next_gen_data/~/.pyenv/versions/next_gen_data_38/lib/python3.8/site-packages/pyarrow/acero.py:247)
 result_table = decl.to_table(use_threads=use_threads)
       
[249](https://file+.vscode-resource.vscode-cdn.net/Users/neil.wadhvana/workspaces/main/torc-robotics/pytorc/projects/next_gen_data/next_gen_data/~/.pyenv/versions/next_gen_data_38/lib/python3.8/site-packages/pyarrow/acero.py:249)
 if output_type == Table:
       
[250](https://file+.vscode-resource.vscode-cdn.net/Users/neil.wadhvana/workspaces/main/torc-robotics/pytorc/projects/next_gen_data/next_gen_data/~/.pyenv/versions/next_gen_data_38/lib/python3.8/site-packages/pyarrow/acero.py:250)
     return result_table
   
   File 
~/.pyenv/versions/next_gen_data_38/lib/python3.8/site-packages/pyarrow/_acero.pyx:590,
 in pyarrow._acero.Declaration.to_table()
   
   File 
~/.pyenv/versions/next_gen_data_38/lib/python3.8/site-packages/pyarrow/error.pxi:155,
 in pyarrow.lib.pyarrow_internal_check_status()
   
   File 
~/.pyenv/versions/next_gen_data_38/lib/python3.8/site-packages/pyarrow/error.pxi:92,
 in pyarrow.lib.check_status()
   
   ArrowInvalid: Data type extension<arrow.fixed_shape_tensor[value_type=uint8, 
shape=[10,10], permutation=[0,1]]> is not supported in join non-key field tensor
   ```
   
   How can I make this work? The individual tensors I want to store are rather 
small (single-digit-dimensions), but the join may lead to list aggregation of a 
few hundred rows.
   
   I've tagged this as a python question because I don't know what level of API 
needs to be adjusted to add this functionality.
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to