NinaPeng opened a new issue, #36878:
URL: https://github.com/apache/arrow/issues/36878

   ### Describe the enhancement requested
   
   So far we have two use scenarios that may need the intermediate status of 
aggregate kernels during its consumption:
   1) a shuffle-free single stage distributed query engine. We have our data 
partitioned and stored in multiple nodes, and would like to create a query plan 
with multiple fragments and retrieves partitioned data from all these nodes in 
parallel for better performance. Data shuffling is non trivial to implement for 
us, and we are looking for an approach that is simpler to implement. For 
aggregation query, one way to do it seems to: split the aggregation into 
pre-aggregation and finalize/combine two steps. For pre-aggregation, the 
aggregation operator only consumes the data and stores the intermediate results 
internally. For the `finalize/combine` step, it combines multiple partitioned 
intermediate results as the final result.
   
   2) a materialized view that stores the intermediate status for aggregation, 
so that partial aggregated results (the intermediate status of aggregation) is 
stored in materialized view on disk, it will be faster when reading the 
materialized view since only the `finalize` computation is needed to get the 
results
   
   Although for some aggregation kernel such as `avg`, we could use two 
existing aggregate kernels (sum/count) to manually maintain such intermediate 
status, but it requires developers to understand how these kernels are 
implemented internally, which is probably not easy and new aggregate kernels 
may be added in the future.
   
   Since acero's aggregate kernels already maintain such intermediate status 
internally, I wonder if it is possible to have some APIs in aggregate kernels 
to retrieve these intermediate status to enable such use scenarios. Thanks.
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to