Re: [PR] Added support for Polars DataFrame and LazyFarame [iceberg-python]

via GitHub Sat, 08 Feb 2025 11:21:09 -0800


kevinjqliu commented on code in PR #1614:
URL: https://github.com/apache/iceberg-python/pull/1614#discussion_r1947932757



##########
mkdocs/docs/api.md:
##########
@@ -1533,3 +1533,111 @@ df.show(2)
 
 (Showing first 2 rows)
 ```
+
+### Polars
+
+PyIceberg interfaces closely with Polars Dataframes and LazyFrame which 
provides a full lazily optimized query engine interface on top of PyIceberg 
tables.
+
+<!-- prettier-ignore-start -->
+
+!!! note "Requirements"
+    This requires [`polars` to be installed](index.md).
+
+<!-- prettier-ignore-end -->
+
+#### Working with Polars DataFrame
+
+PyIceberg makes it easy to filter out data from a huge table and pull it into 
a Polars dataframe locally. This will only fetch the relevant Parquet files for 
the query and apply the filter. This will reduce IO and therefore improve 
performance and reduce cost.
+
+```python
+schema = Schema(
+    NestedField(field_id=1, name='ticket_id', field_type=LongType(), 
required=True),
+    NestedField(field_id=2, name='customer_id', field_type=LongType(), 
required=True),
+    NestedField(field_id=3, name='issue', field_type=StringType(), 
required=False),
+    NestedField(field_id=4, name='created_at', field_type=TimestampType(), 
required=True), 
+  required=True
+)
+
+iceberg_table = catalog.create_table(
+    identifier='default.product_support_issues',
+    schema=schema
+)
+
+pa_table_data = pa.Table.from_pylist(
+[
+    {'ticket_id': 1, 'customer_id': 546, 'issue': 'User Login issue', 
'created_at': 1650020000000000},
+    {'ticket_id': 2, 'customer_id': 547, 'issue': 'Payment not going through', 
'created_at': 1650028640000000},
+    ...

Review Comment:
   nit remove this so the examples can be easily copy/pasted



##########
pyproject.toml:
##########
@@ -80,6 +80,7 @@ sqlalchemy = { version = "^2.0.18", optional = true }
 getdaft = { version = ">=0.2.12", optional = true }
 cachetools = "^5.5.0"
 pyiceberg-core = { version = "^0.4.0", optional = true }
+polars = "^1.21.0"

Review Comment:
   this should be optional
   ```suggestion
   polars = { version = "^1.21.0", optional = true }
   ```



##########
pyiceberg/table/__init__.py:
##########
@@ -1624,6 +1628,19 @@ def to_ray(self) -> ray.data.dataset.Dataset:
 
         return ray.data.from_arrow(self.to_arrow())
 
+    def to_polars(self) -> pl.DataFrame:

Review Comment:
   > 'to_polars' is DataScan class method and not a Table in pyiceberg.
   
   ah yes, thanks! 



##########
mkdocs/docs/api.md:
##########
@@ -1533,3 +1533,111 @@ df.show(2)
 
 (Showing first 2 rows)
 ```
+
+### Polars
+
+PyIceberg interfaces closely with Polars Dataframes and LazyFrame which 
provides a full lazily optimized query engine interface on top of PyIceberg 
tables.
+
+<!-- prettier-ignore-start -->
+
+!!! note "Requirements"
+    This requires [`polars` to be installed](index.md).

Review Comment:
   i know the other sections also do the same, but we should mention 
`pyiceberg['polars']` as a way to install this dep



##########
mkdocs/docs/api.md:
##########
@@ -1533,3 +1533,111 @@ df.show(2)
 
 (Showing first 2 rows)
 ```
+
+### Polars
+
+PyIceberg interfaces closely with Polars Dataframes and LazyFrame which 
provides a full lazily optimized query engine interface on top of PyIceberg 
tables.
+
+<!-- prettier-ignore-start -->
+
+!!! note "Requirements"
+    This requires [`polars` to be installed](index.md).
+
+<!-- prettier-ignore-end -->
+
+#### Working with Polars DataFrame
+
+PyIceberg makes it easy to filter out data from a huge table and pull it into 
a Polars dataframe locally. This will only fetch the relevant Parquet files for 
the query and apply the filter. This will reduce IO and therefore improve 
performance and reduce cost.
+
+```python

Review Comment:
   nit: can you run this python code through a code formatter? 



##########
pyiceberg/table/__init__.py:
##########
@@ -1624,6 +1638,19 @@ def to_ray(self) -> ray.data.dataset.Dataset:
 
         return ray.data.from_arrow(self.to_arrow())
 
+    def to_polars(self) -> pl.DataFrame:
+        """Read a Polars DataFrame from this Iceberg table.
+
+        Returns:
+            pl.DataFrame: Materialized Polars Dataframe from the Iceberg table
+        """
+        import polars as pl
+
+        result = pl.from_arrow(self.to_arrow())
+        if isinstance(result, pl.Series):
+            result = result.to_frame()

Review Comment:
   nit should we be opinionated about this here? 
   or just return the same signature as .to_arrow()
   https://docs.pola.rs/api/python/dev/reference/api/polars.from_arrow.html#



##########
mkdocs/docs/api.md:
##########
@@ -1533,3 +1533,111 @@ df.show(2)
 
 (Showing first 2 rows)
 ```
+
+### Polars

Review Comment:
   Thanks for the docs! This is great. Only thing i would add is to also 
mention the difference between
   `iceberg_table.to_polars()` and `iceberg_table.scan().to_polars()`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Added support for Polars DataFrame and LazyFarame [iceberg-python]

Reply via email to