GitHub user avamingli created a discussion: [Ideas] Switch to  Streaming 
Hashagg for multiple phase aggregation in postgres planner.

### Description

Currently, the Postgres planner selectively employs streaming aggregation in 
specific scenarios, such as DISTINCT operations or when parallelism is 
involved. 
However, this capability is not generically available for other multi-phase 
aggregation plans.

I've been looking at our TPC-DS results and the performance difference with 
Orca.
Recent benchmarking against the Orca optimizer using TPC-DS has revealed a 
consistent performance gap:  even when the final plans are very similar, Orca 
tends to be faster.
The root cause has been identified: Orca, by design, defaults to using 
Streaming Hash Aggregation for multi-phase aggregations, while the Postgres 
planner does not.

**Analysis of Streaming vs. Non-Streaming Aggregation**
The choice between streaming and non-streaming (hash) aggregation involves a 
trade-off:

* Benefits of Streaming: It can push aggregation computations higher up the 
plan tree, which often enhances parallelism and reduces the need for heavy data 
materialization.

* Drawbacks of Streaming: It may convert potential disk I/O pressure into 
network pressure, as grouped data is streamed between nodes.

It's hard for the planner to weigh this precisely because it depends on data 
distribution and duplication. But the empirical data from our benchmarks is 
clear: streaming is a net win overall. 

In the best-case scenario—where data per segment is largely unique—the 
non-streaming mode may force a full spill of all data to disk, while the 
streaming mode incurs almost no additional overhead.

So, I propose we switch the Postgres planner to use Streaming Hash Aggregation 
for these cases to match Orca's approach and get that performance boost.


### Use case/motivation

_No response_

### Related issues

_No response_

### Are you willing to submit a PR?

- [X] Yes I am willing to submit a PR!

GitHub link: https://github.com/apache/cloudberry/discussions/1411

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to