rdblue opened a new pull request, #6590:
URL: https://github.com/apache/iceberg/pull/6590

   This adds an experimental `sql` command to the CLI. The table output is not 
quite finished, but it can use duckdb to run queries, like this:
   
   ```
   [blue@work python]$ time poetry run pyiceberg --verbose true sql --table 
taxi.nyc_taxi_yellow="pickup_time >= '2021-12-01T00:00:00+00:00'" 'select 
count(1) as ride_count from nyc_taxi_yellow'
   (3214368,)
   
   real    2m4.358s
   user    0m5.163s
   sys     0m4.465s
   ```
   
   The CLI requires `--table` options to identify tables and set filters for 
the load. Those tables are planned using the given filter and added by table 
name to duckdb. In the example above, `taxi.nyc_taxi_yellow` is loaded as 
`nyc_taxi_yellow` and is filtered using the filter string after `=`.
   
   This also parallelizes scan planning, which now takes about a second in 
total. The remainder of the time is spent reading the Parquet data files. That 
apparently needs to be optimized because the read takes a long time (loading 
the dataset, not the pyiceberg projection).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to