bitsondatadev commented on code in PR #9836: URL: https://github.com/apache/iceberg/pull/9836#discussion_r1508089551
########## docs/docs/daft.md: ########## @@ -0,0 +1,148 @@ +--- +title: "Daft" +--- +<!-- + - Licensed to the Apache Software Foundation (ASF) under one or more + - contributor license agreements. See the NOTICE file distributed with + - this work for additional information regarding copyright ownership. + - The ASF licenses this file to You under the Apache License, Version 2.0 + - (the "License"); you may not use this file except in compliance with + - the License. You may obtain a copy of the License at + - + - http://www.apache.org/licenses/LICENSE-2.0 + - + - Unless required by applicable law or agreed to in writing, software + - distributed under the License is distributed on an "AS IS" BASIS, + - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + - See the License for the specific language governing permissions and + - limitations under the License. + --> + +# Daft + +[Daft](www.getdaft.io) is a Python/Rust-based distributed query engine with a Python DataFrame API. Review Comment: Rewording for some context and links to the API. ```suggestion [Daft](www.getdaft.io) is a distributed query engine written in Python and Rust, two fast-growing ecosystems in the data engineering and machine learning industry. It exposes it's flavor of the widely adopted [DataFrame API](https://www.getdaft.io/projects/docs/en/latest/api_docs/dataframe.html) akin to many existing Python libraries. ``` ########## docs/docs/daft.md: ########## @@ -0,0 +1,148 @@ +--- +title: "Daft" +--- +<!-- + - Licensed to the Apache Software Foundation (ASF) under one or more + - contributor license agreements. See the NOTICE file distributed with + - this work for additional information regarding copyright ownership. + - The ASF licenses this file to You under the Apache License, Version 2.0 + - (the "License"); you may not use this file except in compliance with + - the License. You may obtain a copy of the License at + - + - http://www.apache.org/licenses/LICENSE-2.0 + - + - Unless required by applicable law or agreed to in writing, software + - distributed under the License is distributed on an "AS IS" BASIS, + - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + - See the License for the specific language governing permissions and + - limitations under the License. + --> + +# Daft + +[Daft](www.getdaft.io) is a Python/Rust-based distributed query engine with a Python DataFrame API. + +Iceberg supports reading of Iceberg tables into Daft DataFrames by using the Python client library [PyIceberg](https://py.iceberg.apache.org/). + +For Python users, Daft is complementary to PyIceberg as a query engine layer: + +* **PyIceberg:** catalog/table management tasks (e.g. creation of tables, modifying table schemas) +* **Daft:** querying tables (e.g. previewing tables, data ETL and analysis) Review Comment: This section didn't say much of anything new for Iceberg users, and really missed the value that the Daft integration offers them. Feel free to counter my suggestions with another one. Also, try to dig more into Daft features that connect Iceberg from Data Engineers to Data Scientists/ML engineers. ```suggestion [PyIceberg](https://py.iceberg.apache.org/) supports reading of Iceberg tables into Daft DataFrames, which simplifies running transformation and machine learning workloads in the Python ecosystem. This offers a novel experience for data consumers to migrate their models in-place using Iceberg's catalog and table management, while utilizing Daft's compute engine capabilities for use cases from traditional analysis, to advanced feature training. ``` ########## docs/docs/daft.md: ########## @@ -0,0 +1,148 @@ +--- +title: "Daft" +--- +<!-- + - Licensed to the Apache Software Foundation (ASF) under one or more + - contributor license agreements. See the NOTICE file distributed with + - this work for additional information regarding copyright ownership. + - The ASF licenses this file to You under the Apache License, Version 2.0 + - (the "License"); you may not use this file except in compliance with + - the License. You may obtain a copy of the License at + - + - http://www.apache.org/licenses/LICENSE-2.0 + - + - Unless required by applicable law or agreed to in writing, software + - distributed under the License is distributed on an "AS IS" BASIS, + - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + - See the License for the specific language governing permissions and + - limitations under the License. + --> + +# Daft + +[Daft](www.getdaft.io) is a Python/Rust-based distributed query engine with a Python DataFrame API. + +Iceberg supports reading of Iceberg tables into Daft DataFrames by using the Python client library [PyIceberg](https://py.iceberg.apache.org/). + +For Python users, Daft is complementary to PyIceberg as a query engine layer: + +* **PyIceberg:** catalog/table management tasks (e.g. creation of tables, modifying table schemas) +* **Daft:** querying tables (e.g. previewing tables, data ETL and analysis) + +In database terms, PyIceberg is the Data Description Language (DDL) for database administration and Daft is the Data Manipulation Language (DML) for querying data. + +## Enabling Iceberg support in Daft + +To use Iceberg with Daft, simply ensure that the [PyIceberg](https://py.iceberg.apache.org/) library is also installed in your current Python environment. + +``` +pip install getdaft pyiceberg +``` + +## Querying Iceberg using Daft + +### Reading PyIceberg tables + +Daft interacts natively with [PyIceberg](https://py.iceberg.apache.org/) to read from Iceberg. + +Simply load a PyIceberg table and pass it into Daft as follows: Review Comment: Avoid empty lines between headings, at least add a single sentence to visually break things up. Avoid saying reading from Iceberg, as Iceberg is mainly a table spec and some opinionated libraries, not a running system. Remove "Simply". ```suggestion ## Querying Iceberg using Daft Daft interacts natively with [PyIceberg](https://py.iceberg.apache.org/) to read Iceberg tables. ### Reading Iceberg tables Create an Iceberg table following [the spark-quickstart tutorial](https://iceberg.apache.org/spark-quickstart/). Load the Iceberg table `demo.nyc.taxis` it into Daft, limiting to the first three columns. ``` ########## docs/docs/daft.md: ########## @@ -0,0 +1,148 @@ +--- +title: "Daft" +--- +<!-- + - Licensed to the Apache Software Foundation (ASF) under one or more + - contributor license agreements. See the NOTICE file distributed with + - this work for additional information regarding copyright ownership. + - The ASF licenses this file to You under the Apache License, Version 2.0 + - (the "License"); you may not use this file except in compliance with + - the License. You may obtain a copy of the License at + - + - http://www.apache.org/licenses/LICENSE-2.0 + - + - Unless required by applicable law or agreed to in writing, software + - distributed under the License is distributed on an "AS IS" BASIS, + - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + - See the License for the specific language governing permissions and + - limitations under the License. + --> + +# Daft + +[Daft](www.getdaft.io) is a Python/Rust-based distributed query engine with a Python DataFrame API. + +Iceberg supports reading of Iceberg tables into Daft DataFrames by using the Python client library [PyIceberg](https://py.iceberg.apache.org/). + +For Python users, Daft is complementary to PyIceberg as a query engine layer: + +* **PyIceberg:** catalog/table management tasks (e.g. creation of tables, modifying table schemas) +* **Daft:** querying tables (e.g. previewing tables, data ETL and analysis) + +In database terms, PyIceberg is the Data Description Language (DDL) for database administration and Daft is the Data Manipulation Language (DML) for querying data. + +## Enabling Iceberg support in Daft + +To use Iceberg with Daft, simply ensure that the [PyIceberg](https://py.iceberg.apache.org/) library is also installed in your current Python environment. + +``` +pip install getdaft pyiceberg +``` + +## Querying Iceberg using Daft + +### Reading PyIceberg tables + +Daft interacts natively with [PyIceberg](https://py.iceberg.apache.org/) to read from Iceberg. + +Simply load a PyIceberg table and pass it into Daft as follows: + +``` py +import daft +from pyiceberg import load_catalog + +table = load_catalog("my_catalog").load_table("my_tpch_namespace.lineitem") +df = daft.read_iceberg(table) +df = df.select("L_SHIPDATE", "L_ORDERKEY", "L_COMMENT") +df.show() Review Comment: We should use the taxi cab data as it's consistent with much of the other documentation. https://iceberg.apache.org/spark-quickstart/?h=catalog#creating-a-table ```suggestion table = load_catalog("demo").load_table("nyc.taxis") df = daft.read_iceberg(table) df = df.select("vendor_id", "trip_id", "trip_distance") df.show() ``` ########## docs/docs/daft.md: ########## @@ -0,0 +1,148 @@ +--- +title: "Daft" +--- +<!-- + - Licensed to the Apache Software Foundation (ASF) under one or more + - contributor license agreements. See the NOTICE file distributed with + - this work for additional information regarding copyright ownership. + - The ASF licenses this file to You under the Apache License, Version 2.0 + - (the "License"); you may not use this file except in compliance with + - the License. You may obtain a copy of the License at + - + - http://www.apache.org/licenses/LICENSE-2.0 + - + - Unless required by applicable law or agreed to in writing, software + - distributed under the License is distributed on an "AS IS" BASIS, + - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + - See the License for the specific language governing permissions and + - limitations under the License. + --> + +# Daft + +[Daft](www.getdaft.io) is a Python/Rust-based distributed query engine with a Python DataFrame API. + +Iceberg supports reading of Iceberg tables into Daft DataFrames by using the Python client library [PyIceberg](https://py.iceberg.apache.org/). + +For Python users, Daft is complementary to PyIceberg as a query engine layer: + +* **PyIceberg:** catalog/table management tasks (e.g. creation of tables, modifying table schemas) +* **Daft:** querying tables (e.g. previewing tables, data ETL and analysis) + +In database terms, PyIceberg is the Data Description Language (DDL) for database administration and Daft is the Data Manipulation Language (DML) for querying data. + +## Enabling Iceberg support in Daft + +To use Iceberg with Daft, simply ensure that the [PyIceberg](https://py.iceberg.apache.org/) library is also installed in your current Python environment. Review Comment: We'll eventually be adding the Microsoft styling guide, but try to avoid "simple"/"simply"term in general unless used in a context. It can come of as condescending, especially to a beginner. https://learn.microsoft.com/en-us/style-guide/a-z-word-list-term-collections/s/simply ```suggestion To use Iceberg with Daft, ensure that the [PyIceberg](https://py.iceberg.apache.org/) library is also installed in your current Python environment. ``` ########## docs/docs/daft.md: ########## @@ -0,0 +1,148 @@ +--- +title: "Daft" +--- +<!-- + - Licensed to the Apache Software Foundation (ASF) under one or more + - contributor license agreements. See the NOTICE file distributed with + - this work for additional information regarding copyright ownership. + - The ASF licenses this file to You under the Apache License, Version 2.0 + - (the "License"); you may not use this file except in compliance with + - the License. You may obtain a copy of the License at + - + - http://www.apache.org/licenses/LICENSE-2.0 + - + - Unless required by applicable law or agreed to in writing, software + - distributed under the License is distributed on an "AS IS" BASIS, + - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + - See the License for the specific language governing permissions and + - limitations under the License. + --> + +# Daft + +[Daft](www.getdaft.io) is a Python/Rust-based distributed query engine with a Python DataFrame API. + +Iceberg supports reading of Iceberg tables into Daft DataFrames by using the Python client library [PyIceberg](https://py.iceberg.apache.org/). + +For Python users, Daft is complementary to PyIceberg as a query engine layer: + +* **PyIceberg:** catalog/table management tasks (e.g. creation of tables, modifying table schemas) +* **Daft:** querying tables (e.g. previewing tables, data ETL and analysis) + +In database terms, PyIceberg is the Data Description Language (DDL) for database administration and Daft is the Data Manipulation Language (DML) for querying data. + +## Enabling Iceberg support in Daft + +To use Iceberg with Daft, simply ensure that the [PyIceberg](https://py.iceberg.apache.org/) library is also installed in your current Python environment. + +``` +pip install getdaft pyiceberg +``` + +## Querying Iceberg using Daft + +### Reading PyIceberg tables + +Daft interacts natively with [PyIceberg](https://py.iceberg.apache.org/) to read from Iceberg. + +Simply load a PyIceberg table and pass it into Daft as follows: + +``` py +import daft +from pyiceberg import load_catalog + +table = load_catalog("my_catalog").load_table("my_tpch_namespace.lineitem") +df = daft.read_iceberg(table) +df = df.select("L_SHIPDATE", "L_ORDERKEY", "L_COMMENT") +df.show() +``` + +``` +╭────────────┬────────────┬────────────────────────────────╮ +│ L_SHIPDATE ┆ L_ORDERKEY ┆ L_COMMENT │ +│ --- ┆ --- ┆ --- │ +│ Date ┆ Int64 ┆ Utf8 │ +╞════════════╪════════════╪════════════════════════════════╡ +│ 1992-01-02 ┆ 2186280097 ┆ ions sleep about the si │ +├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ +│ 1992-01-02 ┆ 175366628 ┆ gular accoun │ +├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ +│ 1992-01-02 ┆ 2186602151 ┆ blithely even │ +├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ +│ 1992-01-02 ┆ 3937663654 ┆ ake boldly among the ideas. s… │ +├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ +│ 1992-01-02 ┆ 2186781220 ┆ thely. slyly pending ideas ar… │ +├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ +│ 1992-01-02 ┆ 3937999493 ┆ haggle at the regular, pen │ +├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ +│ 1992-01-02 ┆ 2186933061 ┆ ickly. slyly │ +├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ +│ 1992-01-02 ┆ 3938167204 ┆ carefully silent instructions… │ +╰────────────┴────────────┴────────────────────────────────╯ + +(Showing first 8 rows) +``` + +Any subsequent filter operations on the Daft `df` DataFrame object will be correctly optimized to take advantage of Iceberg features such as hidden partitioning and file-level statistics for efficient reads. Review Comment: ```suggestion Any filter operations on the Daft dataframe, `df`, will [push down the filters](https://iceberg.apache.org/docs/latest/performance/#data-filtering), effectuate [hidden partitioning](https://iceberg.apache.org/docs/latest/partitioning/), and utilize [table statistics to inform query planning](https://iceberg.apache.org/docs/latest/performance/#scan-planning) for efficient reads. ``` ########## docs/docs/daft.md: ########## @@ -0,0 +1,148 @@ +--- +title: "Daft" +--- +<!-- + - Licensed to the Apache Software Foundation (ASF) under one or more + - contributor license agreements. See the NOTICE file distributed with + - this work for additional information regarding copyright ownership. + - The ASF licenses this file to You under the Apache License, Version 2.0 + - (the "License"); you may not use this file except in compliance with + - the License. You may obtain a copy of the License at + - + - http://www.apache.org/licenses/LICENSE-2.0 + - + - Unless required by applicable law or agreed to in writing, software + - distributed under the License is distributed on an "AS IS" BASIS, + - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + - See the License for the specific language governing permissions and + - limitations under the License. + --> + +# Daft + +[Daft](www.getdaft.io) is a Python/Rust-based distributed query engine with a Python DataFrame API. + +Iceberg supports reading of Iceberg tables into Daft DataFrames by using the Python client library [PyIceberg](https://py.iceberg.apache.org/). + +For Python users, Daft is complementary to PyIceberg as a query engine layer: + +* **PyIceberg:** catalog/table management tasks (e.g. creation of tables, modifying table schemas) +* **Daft:** querying tables (e.g. previewing tables, data ETL and analysis) + +In database terms, PyIceberg is the Data Description Language (DDL) for database administration and Daft is the Data Manipulation Language (DML) for querying data. Review Comment: Hard disagree to this statement, Iceberg is a spec that supports SQL in all categories, pyIceberg has a minimal implementation that supports DDL/DML, and theoretically engines will use the api, to standardize and not have to re-invent the wheel for all categories of SQL statements. What were you actually trying to communicate here and let's think about how to word it? I'll make an educated guess and please feel free to rephrase. ```suggestion Daft's strength, is that it provides a DataFrame representation and transformation suite, while integrating between [Iceberg tables](https://www.getdaft.io/projects/docs/en/latest/api_docs/doc_gen/io_functions/daft.read_iceberg.html) other specialized systems such as [Ray](https://www.getdaft.io/projects/docs/en/latest/api_docs/doc_gen/io_functions/daft.from_ray_dataset.html), [Dask](https://www.getdaft.io/projects/docs/en/latest/api_docs/doc_gen/io_functions/daft.from_dask_dataframe.html), [Arrow](https://www.getdaft.io/projects/docs/en/latest/api_docs/doc_gen/io_functions/daft.from_arrow.html), [Parquet](https://www.getdaft.io/projects/docs/en/latest/api_docs/doc_gen/io_functions/daft.read_parquet.html), and a few other formats. Combined with Iceberg's ability to manage table-level concerns like schema and partition evolution, this makes an ultimate pairing of storage and compute technologies to facilitate heavy processing and machine learning workloads. ``` ########## docs/docs/daft.md: ########## @@ -0,0 +1,148 @@ +--- +title: "Daft" +--- +<!-- + - Licensed to the Apache Software Foundation (ASF) under one or more + - contributor license agreements. See the NOTICE file distributed with + - this work for additional information regarding copyright ownership. + - The ASF licenses this file to You under the Apache License, Version 2.0 + - (the "License"); you may not use this file except in compliance with + - the License. You may obtain a copy of the License at + - + - http://www.apache.org/licenses/LICENSE-2.0 + - + - Unless required by applicable law or agreed to in writing, software + - distributed under the License is distributed on an "AS IS" BASIS, + - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + - See the License for the specific language governing permissions and + - limitations under the License. + --> + +# Daft + +[Daft](www.getdaft.io) is a Python/Rust-based distributed query engine with a Python DataFrame API. + +Iceberg supports reading of Iceberg tables into Daft DataFrames by using the Python client library [PyIceberg](https://py.iceberg.apache.org/). + +For Python users, Daft is complementary to PyIceberg as a query engine layer: + +* **PyIceberg:** catalog/table management tasks (e.g. creation of tables, modifying table schemas) +* **Daft:** querying tables (e.g. previewing tables, data ETL and analysis) + +In database terms, PyIceberg is the Data Description Language (DDL) for database administration and Daft is the Data Manipulation Language (DML) for querying data. + +## Enabling Iceberg support in Daft + +To use Iceberg with Daft, simply ensure that the [PyIceberg](https://py.iceberg.apache.org/) library is also installed in your current Python environment. + +``` +pip install getdaft pyiceberg +``` + +## Querying Iceberg using Daft + +### Reading PyIceberg tables + +Daft interacts natively with [PyIceberg](https://py.iceberg.apache.org/) to read from Iceberg. + +Simply load a PyIceberg table and pass it into Daft as follows: + +``` py +import daft +from pyiceberg import load_catalog + +table = load_catalog("my_catalog").load_table("my_tpch_namespace.lineitem") +df = daft.read_iceberg(table) +df = df.select("L_SHIPDATE", "L_ORDERKEY", "L_COMMENT") +df.show() +``` + +``` +╭────────────┬────────────┬────────────────────────────────╮ +│ L_SHIPDATE ┆ L_ORDERKEY ┆ L_COMMENT │ +│ --- ┆ --- ┆ --- │ +│ Date ┆ Int64 ┆ Utf8 │ +╞════════════╪════════════╪════════════════════════════════╡ +│ 1992-01-02 ┆ 2186280097 ┆ ions sleep about the si │ +├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ +│ 1992-01-02 ┆ 175366628 ┆ gular accoun │ +├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ +│ 1992-01-02 ┆ 2186602151 ┆ blithely even │ +├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ +│ 1992-01-02 ┆ 3937663654 ┆ ake boldly among the ideas. s… │ +├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ +│ 1992-01-02 ┆ 2186781220 ┆ thely. slyly pending ideas ar… │ +├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ +│ 1992-01-02 ┆ 3937999493 ┆ haggle at the regular, pen │ +├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ +│ 1992-01-02 ┆ 2186933061 ┆ ickly. slyly │ +├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ +│ 1992-01-02 ┆ 3938167204 ┆ carefully silent instructions… │ +╰────────────┴────────────┴────────────────────────────────╯ + +(Showing first 8 rows) +``` Review Comment: Regenerate this using the Spark demo: https://iceberg.apache.org/spark-quickstart/ ########## docs/docs/daft.md: ########## @@ -0,0 +1,148 @@ +--- +title: "Daft" +--- +<!-- + - Licensed to the Apache Software Foundation (ASF) under one or more + - contributor license agreements. See the NOTICE file distributed with + - this work for additional information regarding copyright ownership. + - The ASF licenses this file to You under the Apache License, Version 2.0 + - (the "License"); you may not use this file except in compliance with + - the License. You may obtain a copy of the License at + - + - http://www.apache.org/licenses/LICENSE-2.0 + - + - Unless required by applicable law or agreed to in writing, software + - distributed under the License is distributed on an "AS IS" BASIS, + - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + - See the License for the specific language governing permissions and + - limitations under the License. + --> + +# Daft + +[Daft](www.getdaft.io) is a Python/Rust-based distributed query engine with a Python DataFrame API. + +Iceberg supports reading of Iceberg tables into Daft DataFrames by using the Python client library [PyIceberg](https://py.iceberg.apache.org/). + +For Python users, Daft is complementary to PyIceberg as a query engine layer: + +* **PyIceberg:** catalog/table management tasks (e.g. creation of tables, modifying table schemas) +* **Daft:** querying tables (e.g. previewing tables, data ETL and analysis) + +In database terms, PyIceberg is the Data Description Language (DDL) for database administration and Daft is the Data Manipulation Language (DML) for querying data. + +## Enabling Iceberg support in Daft + +To use Iceberg with Daft, simply ensure that the [PyIceberg](https://py.iceberg.apache.org/) library is also installed in your current Python environment. + +``` +pip install getdaft pyiceberg +``` + +## Querying Iceberg using Daft + +### Reading PyIceberg tables + +Daft interacts natively with [PyIceberg](https://py.iceberg.apache.org/) to read from Iceberg. + +Simply load a PyIceberg table and pass it into Daft as follows: + +``` py +import daft +from pyiceberg import load_catalog + +table = load_catalog("my_catalog").load_table("my_tpch_namespace.lineitem") +df = daft.read_iceberg(table) +df = df.select("L_SHIPDATE", "L_ORDERKEY", "L_COMMENT") +df.show() +``` + +``` +╭────────────┬────────────┬────────────────────────────────╮ +│ L_SHIPDATE ┆ L_ORDERKEY ┆ L_COMMENT │ +│ --- ┆ --- ┆ --- │ +│ Date ┆ Int64 ┆ Utf8 │ +╞════════════╪════════════╪════════════════════════════════╡ +│ 1992-01-02 ┆ 2186280097 ┆ ions sleep about the si │ +├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ +│ 1992-01-02 ┆ 175366628 ┆ gular accoun │ +├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ +│ 1992-01-02 ┆ 2186602151 ┆ blithely even │ +├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ +│ 1992-01-02 ┆ 3937663654 ┆ ake boldly among the ideas. s… │ +├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ +│ 1992-01-02 ┆ 2186781220 ┆ thely. slyly pending ideas ar… │ +├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ +│ 1992-01-02 ┆ 3937999493 ┆ haggle at the regular, pen │ +├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ +│ 1992-01-02 ┆ 2186933061 ┆ ickly. slyly │ +├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ +│ 1992-01-02 ┆ 3938167204 ┆ carefully silent instructions… │ +╰────────────┴────────────┴────────────────────────────────╯ + +(Showing first 8 rows) +``` + +Any subsequent filter operations on the Daft `df` DataFrame object will be correctly optimized to take advantage of Iceberg features such as hidden partitioning and file-level statistics for efficient reads. + +``` py +import datetime + +# Filter which takes advantage of partition pruning capabilities of Iceberg +df = df.where(df["L_SHIPDATE"] > datetime.date(1993, 1, 1)) +df.show() +``` + +``` +╭────────────┬────────────┬────────────────────────────────╮ +│ L_SHIPDATE ┆ L_ORDERKEY ┆ L_COMMENT │ +│ --- ┆ --- ┆ --- │ +│ Date ┆ Int64 ┆ Utf8 │ +╞════════════╪════════════╪════════════════════════════════╡ +│ 1993-01-02 ┆ 5695313125 ┆ slyly special p │ +├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ +│ 1993-01-02 ┆ 2701326853 ┆ ironic instru │ +├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ +│ 1993-01-02 ┆ 5695313766 ┆ ly according │ +├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ +│ 1993-01-02 ┆ 2701330720 ┆ y alongside of the blithely │ +├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ +│ 1993-01-02 ┆ 5695315200 ┆ ckly final foxes haggle car │ +├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ +│ 1993-01-02 ┆ 2701331524 ┆ ns doze slyly pending instruc… │ +├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ +│ 1993-01-02 ┆ 5695317377 ┆ re about the ironic, silen │ +├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ +│ 1993-01-02 ┆ 2701342819 ┆ fully even pinto beans wa │ +╰────────────┴────────────┴────────────────────────────────╯ + +(Showing first 8 rows) +``` + +### Type compatibility + +Daft and Iceberg have compatible type systems. Here are how types are converted across the two systems. + +When reading from an Iceberg source into Daft: Review Comment: Would this be different in the other direction? Maybe just delete this line? ########## docs/docs/daft.md: ########## @@ -0,0 +1,148 @@ +--- +title: "Daft" +--- +<!-- + - Licensed to the Apache Software Foundation (ASF) under one or more + - contributor license agreements. See the NOTICE file distributed with + - this work for additional information regarding copyright ownership. + - The ASF licenses this file to You under the Apache License, Version 2.0 + - (the "License"); you may not use this file except in compliance with + - the License. You may obtain a copy of the License at + - + - http://www.apache.org/licenses/LICENSE-2.0 + - + - Unless required by applicable law or agreed to in writing, software + - distributed under the License is distributed on an "AS IS" BASIS, + - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + - See the License for the specific language governing permissions and + - limitations under the License. + --> + +# Daft + +[Daft](www.getdaft.io) is a Python/Rust-based distributed query engine with a Python DataFrame API. + +Iceberg supports reading of Iceberg tables into Daft DataFrames by using the Python client library [PyIceberg](https://py.iceberg.apache.org/). + +For Python users, Daft is complementary to PyIceberg as a query engine layer: + +* **PyIceberg:** catalog/table management tasks (e.g. creation of tables, modifying table schemas) +* **Daft:** querying tables (e.g. previewing tables, data ETL and analysis) + +In database terms, PyIceberg is the Data Description Language (DDL) for database administration and Daft is the Data Manipulation Language (DML) for querying data. + +## Enabling Iceberg support in Daft + +To use Iceberg with Daft, simply ensure that the [PyIceberg](https://py.iceberg.apache.org/) library is also installed in your current Python environment. + +``` +pip install getdaft pyiceberg +``` + +## Querying Iceberg using Daft + +### Reading PyIceberg tables + +Daft interacts natively with [PyIceberg](https://py.iceberg.apache.org/) to read from Iceberg. + +Simply load a PyIceberg table and pass it into Daft as follows: + +``` py +import daft +from pyiceberg import load_catalog + +table = load_catalog("my_catalog").load_table("my_tpch_namespace.lineitem") +df = daft.read_iceberg(table) +df = df.select("L_SHIPDATE", "L_ORDERKEY", "L_COMMENT") +df.show() +``` + +``` +╭────────────┬────────────┬────────────────────────────────╮ +│ L_SHIPDATE ┆ L_ORDERKEY ┆ L_COMMENT │ +│ --- ┆ --- ┆ --- │ +│ Date ┆ Int64 ┆ Utf8 │ +╞════════════╪════════════╪════════════════════════════════╡ +│ 1992-01-02 ┆ 2186280097 ┆ ions sleep about the si │ +├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ +│ 1992-01-02 ┆ 175366628 ┆ gular accoun │ +├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ +│ 1992-01-02 ┆ 2186602151 ┆ blithely even │ +├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ +│ 1992-01-02 ┆ 3937663654 ┆ ake boldly among the ideas. s… │ +├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ +│ 1992-01-02 ┆ 2186781220 ┆ thely. slyly pending ideas ar… │ +├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ +│ 1992-01-02 ┆ 3937999493 ┆ haggle at the regular, pen │ +├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ +│ 1992-01-02 ┆ 2186933061 ┆ ickly. slyly │ +├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ +│ 1992-01-02 ┆ 3938167204 ┆ carefully silent instructions… │ +╰────────────┴────────────┴────────────────────────────────╯ + +(Showing first 8 rows) +``` + +Any subsequent filter operations on the Daft `df` DataFrame object will be correctly optimized to take advantage of Iceberg features such as hidden partitioning and file-level statistics for efficient reads. + +``` py +import datetime + +# Filter which takes advantage of partition pruning capabilities of Iceberg +df = df.where(df["L_SHIPDATE"] > datetime.date(1993, 1, 1)) +df.show() Review Comment: Try to update this with a date field...I was trying to materialize the snapshot table and I had just found the from_pylist method but not sure if it will work with a custom object since it is a List of Dictionaries. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org