huaxingao opened a new pull request, #6981:
URL: https://github.com/apache/iceberg/pull/6981
Push down min/max/count with group by if group by is on partition columns
For example:
```
CREATE TABLE test (id LONG, ts TIMESTAMP, data INT) USING iceberg
PARTITIONED BY (id, ts);
SELECT MIN(data), MAX(data), COUNT(data), COUNT(*) FROM test GROUP BY id, ts;
```
I have the following changes:
1. Added Expression `GroupBy`, `UnboundGroupBy`, `BoundGroupBy`
2. In `Aggregator`, I will take consideration of `GroupBy`. Use
`MaxAggregator` as an example:
- Add `List<BoundGroupBy>` in the constructor
- Add `private Map<StructLike, T> maxP` which contains max value for each
of the partition.
- The key of the above Map is `AggregateEvaluator.ArrayStructLike`, which
contains the partition values.
- Add `update(R value, StructLike partitionKey)` to update value for
specific partitions
- Add `Map<StructLike, R> currentPartition()` to get current value for
each partitions
4. Build the Scan schema to be: group by columns + aggregates (this is what
Spark is expecting)
Use `SELECT MIN(data), MAX(data), COUNT(data), COUNT(*) FROM test GROUP BY
id, ts` as an example:
the schema is
`LONG, TIMESTAMP, INT, INT, LONG, LONG`,
which corresponds to
`id, ts, MIN(data), MAX(data), COUNT(data), COUNT(*)`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]