[PR] Add distinct UNION set operator [pinot]

via GitHub Sun, 10 Aug 2025 23:53:41 -0700


yashmayya opened a new pull request, #16570:
URL: https://github.com/apache/pinot/pull/16570


   - Pinot currently only implements a `UNION ALL` operator and `UNION` 
semantics are achieved through the Calcite `UnionToDistinct` rule that adds a 
distinct / empty grouping aggregate on top of `UNION ALL` to get `UNION` 
semantics.
   - This patch implements support for an actual distinct `UNION` set operator 
which offers numerous benefits over the above approach.
   - The primary benefit is that the number of rows being shuffled over the 
network will be reduced. Furthermore, since we use hash based partitioned 
distribution on the exchanges upstream of set operations, we actually don't 
need the "global" distinct after combining outputs from multiple parallel union 
operators (as long as they do the the distinct operation themselves).
   - Note that while we don't officially support colocation for set operations 
unlike joins (since `PinotRelDistributionTraitRule` doesn't have support for 
set operators in order to propagate distribution information through the 
relational plan tree), it can still be used in some limited scenarios. For 
example, consider this query in `ColocatedJoinQuickStart`:
   ```
   SELECT * FROM
   (SELECT userUUID FROM userAttributes /*+ 
tableOptions(partition_key='userUUID', partition_size='2') */)
   UNION
   (SELECT userUUID FROM userFactEvents /*+ 
tableOptions(partition_key='userUUID', partition_size='2') */);
   ```
   This can actually leverage colocation since the `tableOptions` hint applies 
to the table scans and associated mailboxes and will result in 0 rows being 
shuffled over the network for the `UNION` operation. However, without support 
for an actual distinct `UNION` operator, this would be a little futile because 
the global distinct after the union would still result in full data shuffle. 
With the new distinct `UNION` operator added here, the above query will result 
in no network data shuffle (apart from the root server -> broker stage) and can 
be verified through the query stats.
   - Note that the `UnionToDistinct` Calcite rule hasn't been removed 
completely - it has been retained as an optional off-by-default rule that can 
be used via query option `SET usePlannerRules='UnionToDistinct';` in case any 
scenarios are encountered where it's more efficient to delegate to a global 
distinct.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] Add distinct UNION set operator [pinot]

Reply via email to