sharon92 opened a new issue, #46020:
URL: https://github.com/apache/arrow/issues/46020

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   I have encountered a bug in pyarrow, after spending days to find the problem 
in my code.
   
   if I initiate a table with large number of values, and then group the values 
by keys, the resulting keys are not same in every run.
   
   Some runs output a different result with the same input.
   
   ```
   import numpy as np
   import pyarrow as pa
   
   vals = np.random.rand(10000000)
   keys = (vals*100).astype(int)
   
   def compare(new, old):
       if old is None:
           return
       if not np.array_equal(new, old):
           print("Keys are not same as the last run!")
   
   keys_old = None
   for i in range(100):
       table = pa.table( [pa.array(vals), pa.array(keys)],
              names=["vals", "keys"],
          )
       
       aggregate = table.group_by("keys").aggregate([("vals", "sum")])
       keys_new = aggregate["keys"].to_numpy()
       compare(keys_new, keys_old)
       keys_old = keys_new
       print(i)
   ```
      
   Here is the output:
   
   ```
   0
   1
   2
   3
   4
   5
   6
   7
   8
   9
   10
   11
   12
   13
   14
   15
   16
   Keys are not same as the last run!
   17
   Keys are not same as the last run!
   18
   19
   20
   21
   22
   23
   24
   25
   26
   27
   28
   Keys are not same as the last run!
   29
   Keys are not same as the last run!
   30
   31
   32
   33
   34
   35
   36
   37
   38
   39
   40
   41
   42
   43
   44
   45
   46
   47
   48
   49
   50
   51
   52
   Keys are not same as the last run!
   53
   54
   Keys are not same as the last run!
   55
   56
   57
   58
   59
   60
   Keys are not same as the last run!
   61
   Keys are not same as the last run!
   62
   63
   64
   65
   66
   67
   68
   69
   70
   71
   72
   73
   74
   75
   76
   77
   78
   79
   80
   81
   82
   83
   84
   85
   86
   87
   88
   89
   90
   91
   92
   93
   94
   95
   96
   97
   98
   99
   ```
   
   Using the latest version:
   ```
   
   pa.__version__
   Out[264]: '19.0.1'
   ```
   
   Does a data type need to be defined somewhere in the table? Any help or bug 
fix would be much appreciated thank you!
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to