sharon92 opened a new issue, #46020: URL: https://github.com/apache/arrow/issues/46020
### Describe the bug, including details regarding any error messages, version, and platform. I have encountered a bug in pyarrow, after spending days to find the problem in my code. if I initiate a table with large number of values, and then group the values by keys, the resulting keys are not same in every run. Some runs output a different result with the same input. ``` import numpy as np import pyarrow as pa vals = np.random.rand(10000000) keys = (vals*100).astype(int) def compare(new, old): if old is None: return if not np.array_equal(new, old): print("Keys are not same as the last run!") keys_old = None for i in range(100): table = pa.table( [pa.array(vals), pa.array(keys)], names=["vals", "keys"], ) aggregate = table.group_by("keys").aggregate([("vals", "sum")]) keys_new = aggregate["keys"].to_numpy() compare(keys_new, keys_old) keys_old = keys_new print(i) ``` Here is the output: ``` 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Keys are not same as the last run! 17 Keys are not same as the last run! 18 19 20 21 22 23 24 25 26 27 28 Keys are not same as the last run! 29 Keys are not same as the last run! 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 Keys are not same as the last run! 53 54 Keys are not same as the last run! 55 56 57 58 59 60 Keys are not same as the last run! 61 Keys are not same as the last run! 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 ``` Using the latest version: ``` pa.__version__ Out[264]: '19.0.1' ``` Does a data type need to be defined somewhere in the table? Any help or bug fix would be much appreciated thank you! ### Component(s) Python -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org