[I] [C++][Acero] Poor aggregate performance when there is a large number of batches on the build side [arrow]

via GitHub Tue, 18 Mar 2025 01:17:52 -0700


uchenily opened a new issue, #45847:
URL: https://github.com/apache/arrow/issues/45847


   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   I am running a performance test on a plan as described below. This execution 
plan is fairly straightforward, involving only two source nodes, one hash join 
node, and one aggregation node.
   
   ```ast
            aggr
              +
            join
        +-----+------+
     source_0     source_1
     (probe)      (build)
   ```
   
   However, I have noticed that the performance is far below my expectations. 
So, I made the following table for further analysis.
   
   In this table, the horizontal axis represents the number of batches 
generated on the build side, while the vertical axis denotes the number of 
batches produced on the probe side, with each batch size being `1<<15`.
   
   
   | time(in seconds) | build 1 | build 2 | build 4 | build 16 | build 32 | 
build 64 |
   | ---------------- | ------- | ------- | ------- | -------- | -------- | 
-------- |
   | probe 1          | 0.03    | 0.04    | 0.06    | 0.03     | 7.6      | 
269.7    |
   | probe 2          | 0.04    | 0.04    | 0.06    | 7.7      | 59.1     | 
176.0    |
   | probe 4          | 0.05    | 0.06    | 0.06    | 7.7      | 56.1     | 
229.9    |
   | probe 16         | 0.11    | 0.12    | 0.12    | 3.2      | 32.2     | 
145.8    |
   | probe 32         | 0.19    | 0.19    | 0.20    | 3.1      | 32.2     | 
107.5    |
   | probe 64         | 0.36    | 0.36    | 0.36    | 3.4      | 44.9     | 
134.9    |
   | probe 256        | 1.33    | 1.33    | 1.34    | 5.5      | 30.7     | 
197.3    |
   | probe 1024       | 5.2     | 5.3     | 5.2     | 5.3      | 17.7     | 
50.9     |
   
   
   More information:
   1. I run these tests on a Intel x86_64 machine with about 100 cores.
   
   2. I have noticed that in scenarios where the execution time exceeds 10 
seconds, the CPU utilization is very low, and in the most of time, only one CPU 
core is being used.
   
   3. I found that the `arrow/compute/row/grouper.cc:ConsumeImpl` method is 
quite time-consuming. (map phase)
   
   4. The process of generating data also consumes a considerable amount of 
time, but in the tests mentioned above, the time spent on data generation does 
not account for a significant portion.
   
   5. The distribution of data is also certain to affect the execution time, 
but I have not conducted any relevant verification.
   
   6. In some scenarios, the execution time is longer even with a smaller 
amount of data, which should be considered unreasonable.
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] [C++][Acero] Poor aggregate performance when there is a large number of batches on the build side [arrow]

Reply via email to