rdblue opened a new pull request, #6590: URL: https://github.com/apache/iceberg/pull/6590
This adds an experimental `sql` command to the CLI. The table output is not quite finished, but it can use duckdb to run queries, like this: ``` [blue@work python]$ time poetry run pyiceberg --verbose true sql --table taxi.nyc_taxi_yellow="pickup_time >= '2021-12-01T00:00:00+00:00'" 'select count(1) as ride_count from nyc_taxi_yellow' (3214368,) real 2m4.358s user 0m5.163s sys 0m4.465s ``` The CLI requires `--table` options to identify tables and set filters for the load. Those tables are planned using the given filter and added by table name to duckdb. In the example above, `taxi.nyc_taxi_yellow` is loaded as `nyc_taxi_yellow` and is filtered using the filter string after `=`. This also parallelizes scan planning, which now takes about a second in total. The remainder of the time is spent reading the Parquet data files. That apparently needs to be optimized because the read takes a long time (loading the dataset, not the pyiceberg projection). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org