[I] [pyarrow] `pyarrow.unique` gives garbage results with chunked dictionary arrays. [arrow]

via GitHub Thu, 12 Dec 2024 02:37:43 -0800


Yeshwanth-G opened a new issue, #45010:
URL: https://github.com/apache/arrow/issues/45010


   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   Consider:
   ```
   In [9]: pyarrow.__version__
   Out[9]: '18.0.0'
   In [10]: a = pa.DictionaryArray.from_arrays(pa.array([63], type='int8'), 
[f"a{i}" for i in range(64)])
   
   In [11]: b = pa.DictionaryArray.from_arrays(pa.array([64], type='int8'), 
[f"b{i}" for i in range(65)])
   
   In [12]: c = pa.chunked_array([a, b])
   
   In [13]: pa.compute.unique(c).indices
   Out[13]: 
   <pyarrow.lib.Int8Array object at 0x7f4590491540>
   [
     63,
     -128 # <- bad results.
   ]
   
   In [14]:
   ```
   
   We have a chunked array where each chunk is a dictionary array. For such 
cases, seems like `unique` is trying to fit the final result into input type 
which is leading to wraparound / garbage values in the result.
   
   Even calling `validate` on above result does not error out.
   ```
   In [10]: pa.compute.unique(c).validate()
   
   In [11]:
   ```
   
   Should `unique` just raise an error for such cases or should it try to fit 
the output into a suitable type?
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] [pyarrow] `pyarrow.unique` gives garbage results with chunked dictionary arrays. [arrow]

Reply via email to