pitrou opened a new issue, #45741:
URL: https://github.com/apache/arrow/issues/45741

   ### Describe the enhancement requested
   
   Running some simple benchmarks from Python, I was a bit surprised by the 
performance of group-by aggregations:
   * 1000 groups:
   ```pycon
   >>> n = 10000
   >>> a = pa.table({'group': list(range(n))*2, 'key': ['h']*n+['w']*n, 
'value': range(n*2)})
   >>> %timeit a.group_by('group', use_threads=False).aggregate([('value', 
'sum')])
   496 μs ± 439 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
   >>> %timeit a.group_by('group', use_threads=False).aggregate([(('key', 
'value'), 'pivot_wider', pc.PivotWiderOptions(key_names=('h', 'w')))])
   708 μs ± 1.34 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
   ```
   * 10000 groups:
   ```pycon
   >>> n = 100000
   >>> a = pa.table({'group': list(range(n))*2, 'key': ['h']*n+['w']*n, 
'value': range(n*2)})
   >>> %timeit a.group_by('group', use_threads=False).aggregate([('value', 
'sum')])
   5.93 ms ± 11.6 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
   >>> %timeit a.group_by('group', use_threads=False).aggregate([(('key', 
'value'), 'pivot_wider', pc.PivotWiderOptions(key_names=('h', 'w')))])
   8.23 ms ± 28.9 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
   ```
   
   I was initially expecting `pivot_wider` to be much slower than `sum`, both 
because it does a secondary grouping using a naive `std::unordered_map`, and 
because it does a row-to-column transposition of grouped values. But 
`pivot_wider` only appears to be 50% slower than a simple `sum`.
   
   In absolute numbers, it seems group-by summing hovers at around 30-40M 
rows/second. Given that we're supposed to use a high-performance hash table 
("swiss table" with AVX2 optimizations) and the group ids above are trivially 
distributed integers, this doesn't seem like a very high number.
   
   What should be the expectations here? @zanmato1984 
   
   
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to