huaxingao opened a new pull request, #6981:
URL: https://github.com/apache/iceberg/pull/6981

   Push down min/max/count with group by if group by is on partition columns
   
   For example:
   ```
   CREATE TABLE test (id LONG, ts TIMESTAMP, data INT) USING iceberg 
PARTITIONED BY (id, ts);
   SELECT MIN(data), MAX(data), COUNT(data), COUNT(*) FROM test GROUP BY id, ts;
   ```
   
   
   I have the following changes:
   
   1. Added Expression `GroupBy`, `UnboundGroupBy`, `BoundGroupBy`
   2. In `Aggregator`, I will take consideration of `GroupBy`. Use 
`MaxAggregator` as an example:
   
      - Add `List<BoundGroupBy>` in the constructor
      - Add `private Map<StructLike, T> maxP` which contains max value for each 
of the partition.
      - The key of the above Map is `AggregateEvaluator.ArrayStructLike`, which 
contains the partition values.
      - Add `update(R value, StructLike partitionKey)` to update value for 
specific partitions
      - Add `Map<StructLike, R> currentPartition()` to get current value for 
each partitions
   
   4. Build the Scan schema to be: group by columns + aggregates (this is what 
Spark is expecting)
   Use `SELECT MIN(data), MAX(data), COUNT(data), COUNT(*) FROM test GROUP BY 
id, ts` as an example: 
   the schema is 
   `LONG, TIMESTAMP, INT, INT, LONG, LONG`, 
   which corresponds to
   `id, ts, MIN(data), MAX(data), COUNT(data), COUNT(*)`
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to