roosephu opened a new issue, #45686:
URL: https://github.com/apache/arrow/issues/45686

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   To reproduce: 
   
   ```python
   import pyarrow
   import numpy as np
   
   print(pyarrow.__version__)
   
   N = 2**30 // 4
   
   data = {
       "id": [4, 3, 2, 1],
       "data": [np.zeros(N, dtype=np.int64) + i for i in range(4)],
   }
   table = pyarrow.Table.from_pydict(data)
   
   table2 = table.sort_by("id")
   print(table2)
   ```
   
   Actual output:
   
   ```
   pyarrow.__version__ = '19.0.1'
   pyarrow.Table
   id: int64
   data: list<item: int64>
     child 0, item: int64
   ----
   id: [[1,2,3,4]]
   data: 
[[[1,1,1,1,1,...,1,1,1,1,1],[0,0,0,0,0,...,0,0,0,0,0],[1,1,1,1,1,...,1,1,1,1,1],[0,0,0,0,0,...,0,0,0,0,0]]]
   ```
   
   Changing dtype to `np.int8` gives the correct output:
   
   ```
   pyarrow.__version__ = '19.0.1'
   pyarrow.Table
   id: int64
   data: list<item: int8>
     child 0, item: int8
   ----
   id: [[1,2,3,4]]
   data: 
[[[3,3,3,3,3,...,3,3,3,3,3],[2,2,2,2,2,...,2,2,2,2,2],[1,1,1,1,1,...,1,1,1,1,1],[0,0,0,0,0,...,0,0,0,0,0]]]
   ```
   
   My guess is that it overflows when calculating offsets during sorting, 
although I have no idea how pyarrow works internally.  
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to