Fokko commented on issue #7067:
URL: https://github.com/apache/iceberg/issues/7067#issuecomment-1665449121

   I agree with you there, but it happens after the filtering, so PyIceberg 
will already prune the unrelated files, and filter out the unrelated data. 
   
   The problem with `to_pyarrow_dataset` is that Iceberg has much more 
sophisticated pruning, that can happen at different levels, and this cannot be 
expressed in arrow fragments. We're looking into adding substrait integration 
for PyIceberg, where we could express this, but this is further along.
   
   With [Iceberg's hidden 
partitioning](https://iceberg.apache.org/docs/latest/partitioning/), we don't 
have to do things like:
   ```
       Use the `pyarrow_options` parameter to read only certain partitions.
   
       >>> pl.scan_delta(  # doctest: +SKIP
       ...     table_path,
       ...     pyarrow_options={"partitions": [("year", "=", "2021")]},
       ... )
   ```
   Which I think is very user-unfriendly, because if you don't pass the 
partition, it will also cause Polars to read too much data.
   
   Since `pl.LazyFrame._scan_python_function(...)` only will be called on an 
action, and that passes in the filter-predicate, I think we're fine. Or am I 
missing something?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to