kevinjqliu commented on code in PR #1614: URL: https://github.com/apache/iceberg-python/pull/1614#discussion_r1947932757
########## mkdocs/docs/api.md: ########## @@ -1533,3 +1533,111 @@ df.show(2) (Showing first 2 rows) ``` + +### Polars + +PyIceberg interfaces closely with Polars Dataframes and LazyFrame which provides a full lazily optimized query engine interface on top of PyIceberg tables. + +<!-- prettier-ignore-start --> + +!!! note "Requirements" + This requires [`polars` to be installed](index.md). + +<!-- prettier-ignore-end --> + +#### Working with Polars DataFrame + +PyIceberg makes it easy to filter out data from a huge table and pull it into a Polars dataframe locally. This will only fetch the relevant Parquet files for the query and apply the filter. This will reduce IO and therefore improve performance and reduce cost. + +```python +schema = Schema( + NestedField(field_id=1, name='ticket_id', field_type=LongType(), required=True), + NestedField(field_id=2, name='customer_id', field_type=LongType(), required=True), + NestedField(field_id=3, name='issue', field_type=StringType(), required=False), + NestedField(field_id=4, name='created_at', field_type=TimestampType(), required=True), + required=True +) + +iceberg_table = catalog.create_table( + identifier='default.product_support_issues', + schema=schema +) + +pa_table_data = pa.Table.from_pylist( +[ + {'ticket_id': 1, 'customer_id': 546, 'issue': 'User Login issue', 'created_at': 1650020000000000}, + {'ticket_id': 2, 'customer_id': 547, 'issue': 'Payment not going through', 'created_at': 1650028640000000}, + ... Review Comment: nit remove this so the examples can be easily copy/pasted ########## pyproject.toml: ########## @@ -80,6 +80,7 @@ sqlalchemy = { version = "^2.0.18", optional = true } getdaft = { version = ">=0.2.12", optional = true } cachetools = "^5.5.0" pyiceberg-core = { version = "^0.4.0", optional = true } +polars = "^1.21.0" Review Comment: this should be optional ```suggestion polars = { version = "^1.21.0", optional = true } ``` ########## pyiceberg/table/__init__.py: ########## @@ -1624,6 +1628,19 @@ def to_ray(self) -> ray.data.dataset.Dataset: return ray.data.from_arrow(self.to_arrow()) + def to_polars(self) -> pl.DataFrame: Review Comment: > 'to_polars' is DataScan class method and not a Table in pyiceberg. ah yes, thanks! ########## mkdocs/docs/api.md: ########## @@ -1533,3 +1533,111 @@ df.show(2) (Showing first 2 rows) ``` + +### Polars + +PyIceberg interfaces closely with Polars Dataframes and LazyFrame which provides a full lazily optimized query engine interface on top of PyIceberg tables. + +<!-- prettier-ignore-start --> + +!!! note "Requirements" + This requires [`polars` to be installed](index.md). Review Comment: i know the other sections also do the same, but we should mention `pyiceberg['polars']` as a way to install this dep ########## mkdocs/docs/api.md: ########## @@ -1533,3 +1533,111 @@ df.show(2) (Showing first 2 rows) ``` + +### Polars + +PyIceberg interfaces closely with Polars Dataframes and LazyFrame which provides a full lazily optimized query engine interface on top of PyIceberg tables. + +<!-- prettier-ignore-start --> + +!!! note "Requirements" + This requires [`polars` to be installed](index.md). + +<!-- prettier-ignore-end --> + +#### Working with Polars DataFrame + +PyIceberg makes it easy to filter out data from a huge table and pull it into a Polars dataframe locally. This will only fetch the relevant Parquet files for the query and apply the filter. This will reduce IO and therefore improve performance and reduce cost. + +```python Review Comment: nit: can you run this python code through a code formatter? ########## pyiceberg/table/__init__.py: ########## @@ -1624,6 +1638,19 @@ def to_ray(self) -> ray.data.dataset.Dataset: return ray.data.from_arrow(self.to_arrow()) + def to_polars(self) -> pl.DataFrame: + """Read a Polars DataFrame from this Iceberg table. + + Returns: + pl.DataFrame: Materialized Polars Dataframe from the Iceberg table + """ + import polars as pl + + result = pl.from_arrow(self.to_arrow()) + if isinstance(result, pl.Series): + result = result.to_frame() Review Comment: nit should we be opinionated about this here? or just return the same signature as .to_arrow() https://docs.pola.rs/api/python/dev/reference/api/polars.from_arrow.html# ########## mkdocs/docs/api.md: ########## @@ -1533,3 +1533,111 @@ df.show(2) (Showing first 2 rows) ``` + +### Polars Review Comment: Thanks for the docs! This is great. Only thing i would add is to also mention the difference between `iceberg_table.to_polars()` and `iceberg_table.scan().to_polars()` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org