xingyu-long opened a new issue, #46760: URL: https://github.com/apache/arrow/issues/46760
### Describe the bug, including details regarding any error messages, version, and platform. @richardliaw mentioned the limitation they found for join with two dataset via ray data. here is the script with ray data to reproduce the issue ```python3 In [1]: import ray In [2]: import numpy as np In [3]: data = [{"x": np.random.randn(3, 3), "y": 1} for _ in range(10)] In [4]: ds = ray.data.from_items(data) In [5]: ds.schema() Out[5]: Column Type ------ ---- x numpy.ndarray(shape=(3, 3), dtype=double) y int64 In [6]: data2 = [{"z":2, "y": 1} for _ in range(10)] In [7]: ds2 = ray.data.from_items(data2) In [8]: ds2.schema() Out[8]: Column Type ------ ---- z int64 y int64 In [9]: ds3 = ds.join(ds2, join_type="inner", num_partitions=2, on=("y",)) In [10]: ds3.take(1) RayTaskError(ArrowInvalid): ray::HashShuffleAggregator.finalize() (pid=47755, ip=127.0.0.1, actor_id=beb9b938e32e45e02607910301000000, repr=<ray.data._internal.execution.operators.hash_shuffle.HashShuffleAggregator object at 0x105c30ce0>) File "/Users/xingyulong/Desktop/github/ray/.venv/lib/python3.12/site-packages/ray/data/_internal/execution/operators/hash_shuffle.py", line 1101, in finalize result = self._agg.finalize(partition_id) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/xingyulong/Desktop/github/ray/.venv/lib/python3.12/site-packages/ray/data/_internal/execution/operators/join.py", line 109, in finalize joined = left_seq_partition.join( ^^^^^^^^^^^^^^^^^^^^^^^^ File "pyarrow/table.pxi", line 4849, in pyarrow.lib.Table.join File "/Users/xingyulong/Desktop/github/ray/.venv/lib/python3.12/site-packages/pyarrow/acero.py", line 246, in _perform_join result_table = decl.to_table(use_threads=use_threads) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "pyarrow/_acero.pyx", line 511, in pyarrow._acero.Declaration.to_table File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: Data type extension<ray.data.arrow_tensor_v2<ArrowTensorTypeV2>> is not supported in join non-key field x ``` and looked at the pyarrow error, it seems we hit check here https://github.com/apache/arrow/blob/fbed20a9a23f23dc256539179a90ab465d1b1c2e/cpp/src/arrow/acero/hash_join_node.cc#L49-L58 and corresponding upper caller https://github.com/apache/arrow/blob/fbed20a9a23f23dc256539179a90ab465d1b1c2e/cpp/src/arrow/acero/hash_join_node.cc#L238-L241 It may appear to ray data issue, however, we use pyarrow for underlying join, and was wondering if this is expected? i.e., pyarrow don't support tensor/lists for join? ### Component(s) Python -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org