count push down for partition columns [iceberg]

via GitHub Wed, 10 Jan 2024 09:34:07 -0800


xiaoxuandev opened a new pull request, #9457:
URL: https://github.com/apache/iceberg/pull/9457


   ### Notes
   Support min/max/count aggregate push down for partition columns
   
   - min/max/count aggregate push down is not working if partition columns 
don't present as data columns(the stats won't be present in avro files), so 
even the aggregate has been push down to data source, `AggregateEvaluator` will 
fail, it still go through full table scan
   - add support by updating evaluator based on PartitionData 
   
   ### Testing
   Creating a hive table: 
   CREATE EXTERNAL TABLE store_sales (id int, data INT) PARTITIONED BY 
(ss_sold_date_sk INT)
   then registered as Iceberg table
   
   Tested on Spark 3.5, verified count/min/max been successfully pushdown, and 
simple queries (`select count(ss_sold_date_sk) from store_sales` , `select 
min(ss_sold_date_sk) from store_sales` and `select max(ss_sold_date_sk) from 
store_sales`) has been speed up with the change


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[PR] Spark: Support min/max/count push down for partition columns [iceberg]

Reply via email to