xingyu-long opened a new issue, #46760:
URL: https://github.com/apache/arrow/issues/46760

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   @richardliaw mentioned the limitation they found for join with two dataset 
via ray data.
   
   here is the script with ray data to reproduce the issue
   
   ```python3
   In [1]: import ray
   
   In [2]: import numpy as np
   
   In [3]: data = [{"x": np.random.randn(3, 3), "y": 1} for _ in range(10)]
   
   In [4]: ds = ray.data.from_items(data)
   
   In [5]: ds.schema()
   Out[5]:
   Column  Type
   ------  ----
   x       numpy.ndarray(shape=(3, 3), dtype=double)
   y       int64
   
   In [6]: data2 = [{"z":2, "y": 1} for _ in range(10)]
   
   In [7]: ds2 = ray.data.from_items(data2)
   
   In [8]: ds2.schema()
   Out[8]:
   Column  Type
   ------  ----
   z       int64
   y       int64
   
   
   In [9]: ds3 = ds.join(ds2, join_type="inner", num_partitions=2, on=("y",))
   
   In [10]: ds3.take(1)
   
   
   RayTaskError(ArrowInvalid): ray::HashShuffleAggregator.finalize() 
(pid=47755, ip=127.0.0.1, actor_id=beb9b938e32e45e02607910301000000, 
repr=<ray.data._internal.execution.operators.hash_shuffle.HashShuffleAggregator 
object at 0x105c30ce0>)
     File 
"/Users/xingyulong/Desktop/github/ray/.venv/lib/python3.12/site-packages/ray/data/_internal/execution/operators/hash_shuffle.py",
 line 1101, in finalize
       result = self._agg.finalize(partition_id)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File 
"/Users/xingyulong/Desktop/github/ray/.venv/lib/python3.12/site-packages/ray/data/_internal/execution/operators/join.py",
 line 109, in finalize
       joined = left_seq_partition.join(
                ^^^^^^^^^^^^^^^^^^^^^^^^
     File "pyarrow/table.pxi", line 4849, in pyarrow.lib.Table.join
     File 
"/Users/xingyulong/Desktop/github/ray/.venv/lib/python3.12/site-packages/pyarrow/acero.py",
 line 246, in _perform_join
       result_table = decl.to_table(use_threads=use_threads)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File "pyarrow/_acero.pyx", line 511, in pyarrow._acero.Declaration.to_table
     File "pyarrow/error.pxi", line 154, in 
pyarrow.lib.pyarrow_internal_check_status
     File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
   pyarrow.lib.ArrowInvalid: Data type 
extension<ray.data.arrow_tensor_v2<ArrowTensorTypeV2>> is not supported in join 
non-key field x
   ```
   
   and looked at the pyarrow error,  it seems we hit check here 
https://github.com/apache/arrow/blob/fbed20a9a23f23dc256539179a90ab465d1b1c2e/cpp/src/arrow/acero/hash_join_node.cc#L49-L58
 and corresponding upper caller 
https://github.com/apache/arrow/blob/fbed20a9a23f23dc256539179a90ab465d1b1c2e/cpp/src/arrow/acero/hash_join_node.cc#L238-L241
   
   
   It may appear to ray data issue, however, we use pyarrow for underlying 
join, and was wondering if this is expected? i.e., pyarrow don't support 
tensor/lists for join?
   
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to