huaxingao opened a new pull request, #6981: URL: https://github.com/apache/iceberg/pull/6981
Push down min/max/count with group by if group by is on partition columns For example: ``` CREATE TABLE test (id LONG, ts TIMESTAMP, data INT) USING iceberg PARTITIONED BY (id, ts); SELECT MIN(data), MAX(data), COUNT(data), COUNT(*) FROM test GROUP BY id, ts; ``` I have the following changes: 1. Added Expression `GroupBy`, `UnboundGroupBy`, `BoundGroupBy` 2. In `Aggregator`, I will take consideration of `GroupBy`. Use `MaxAggregator` as an example: - Add `List<BoundGroupBy>` in the constructor - Add `private Map<StructLike, T> maxP` which contains max value for each of the partition. - The key of the above Map is `AggregateEvaluator.ArrayStructLike`, which contains the partition values. - Add `update(R value, StructLike partitionKey)` to update value for specific partitions - Add `Map<StructLike, R> currentPartition()` to get current value for each partitions 4. Build the Scan schema to be: group by columns + aggregates (this is what Spark is expecting) Use `SELECT MIN(data), MAX(data), COUNT(data), COUNT(*) FROM test GROUP BY id, ts` as an example: the schema is `LONG, TIMESTAMP, INT, INT, LONG, LONG`, which corresponds to `id, ts, MIN(data), MAX(data), COUNT(data), COUNT(*)` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org