GitHub user avamingli added a comment to the discussion: Proposal: Enable Parallel DQA Plans (with streaming hash agg)
Hi, we recently encountered the same issue in a customer's environment, so I'd like to revisit this topic. In the customer's setup, there are numerous DISTINCT queries running against very large tables, resulting in slow query performance. The customer's requirement is to improve query efficiency by parallelizing DISTINCT operations. In PostgreSQL, DISTINCT operations cannot be parallelized because deduplication requires ensuring uniqueness across all records. In a single-server database like PostgreSQL, this isn't feasible - worker processes launched by the Gather node would randomly compete when processing data. However, in a distributed environment, DISTINCT operations can indeed be parallelized. In fact, DISTINCT processing based on data distribution strategies already involves multiple processes (a Gang of workers) performing deduplication simultaneously across different nodes. Inspired by this approach, we could implement distributed parallel processing for DISTINCT operations by redistributing data to parallel worker processes based on appropriate conditions for subsequent processing. GitHub link: https://github.com/apache/cloudberry/discussions/914#discussioncomment-13374878 ---- This is an automatically sent email for dev@cloudberry.apache.org. To unsubscribe, please send an email to: dev-unsubscr...@cloudberry.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@cloudberry.apache.org For additional commands, e-mail: dev-h...@cloudberry.apache.org