sharon92 opened a new issue, #46020:
URL: https://github.com/apache/arrow/issues/46020
### Describe the bug, including details regarding any error messages,
version, and platform.
I have encountered a bug in pyarrow, after spending days to find the problem
in my code.
if I initiate a table with large number of values, and then group the values
by keys, the resulting keys are not same in every run.
Some runs output a different result with the same input.
```
import numpy as np
import pyarrow as pa
vals = np.random.rand(10000000)
keys = (vals*100).astype(int)
def compare(new, old):
if old is None:
return
if not np.array_equal(new, old):
print("Keys are not same as the last run!")
keys_old = None
for i in range(100):
table = pa.table( [pa.array(vals), pa.array(keys)],
names=["vals", "keys"],
)
aggregate = table.group_by("keys").aggregate([("vals", "sum")])
keys_new = aggregate["keys"].to_numpy()
compare(keys_new, keys_old)
keys_old = keys_new
print(i)
```
Here is the output:
```
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Keys are not same as the last run!
17
Keys are not same as the last run!
18
19
20
21
22
23
24
25
26
27
28
Keys are not same as the last run!
29
Keys are not same as the last run!
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
Keys are not same as the last run!
53
54
Keys are not same as the last run!
55
56
57
58
59
60
Keys are not same as the last run!
61
Keys are not same as the last run!
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
```
Using the latest version:
```
pa.__version__
Out[264]: '19.0.1'
```
Does a data type need to be defined somewhere in the table? Any help or bug
fix would be much appreciated thank you!
### Component(s)
Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]