NinaPeng opened a new issue, #36878: URL: https://github.com/apache/arrow/issues/36878
### Describe the enhancement requested So far we have two use scenarios that may need the intermediate status of aggregate kernels during its consumption: 1) a shuffle-free single stage distributed query engine. We have our data partitioned and stored in multiple nodes, and would like to create a query plan with multiple fragments and retrieves partitioned data from all these nodes in parallel for better performance. Data shuffling is non trivial to implement for us, and we are looking for an approach that is simpler to implement. For aggregation query, one way to do it seems to: split the aggregation into pre-aggregation and finalize/combine two steps. For pre-aggregation, the aggregation operator only consumes the data and stores the intermediate results internally. For the `finalize/combine` step, it combines multiple partitioned intermediate results as the final result. 2) a materialized view that stores the intermediate status for aggregation, so that partial aggregated results (the intermediate status of aggregation) is stored in materialized view on disk, it will be faster when reading the materialized view since only the `finalize` computation is needed to get the results Although for some aggregation kernel such as `avg`, we could use two existing aggregate kernels (sum/count) to manually maintain such intermediate status, but it requires developers to understand how these kernels are implemented internally, which is probably not easy and new aggregate kernels may be added in the future. Since acero's aggregate kernels already maintain such intermediate status internally, I wonder if it is possible to have some APIs in aggregate kernels to retrieve these intermediate status to enable such use scenarios. Thanks. ### Component(s) C++ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
